Estimating Demand for Differentiated Products with Zeroes in
Market Share Data∗
Amit Gandhi
UW-Madison
Microsoft
Zhentong Lu
SUFE
Xiaoxia Shi †
UW-Madison
April 18, 2017
Abstract
In this paper we introduce a new approach to estimating differentiated product demand
systems that allows for products with zero sales in the data. Zeroes in demand are a common
problem in product differentiated markets, but fall outside the scope of existing demand es-
timation techniques. Our solution to the zeroes problem is based on constructing bounds for
the conditional expectation of the inverse demand. These bounds can be translated into mo-
ment inequalities that are shown to yield consistent and asymptotically normal point estimator
for demand parameters under natural conditions for differentiated product markets. In Monte
Carlo simulations, we demonstrate that the new approach works well even when the fraction of
zeroes is as high as 95%. We apply our estimator to supermarket scanner data and find price
elasticities become on the order of twice as large when zeroes are properly controlled.
Keywords: Demand Estimation, Differentiated Products, Profile, Measurement Error, Mo-
ment Inequality.
JEL: C01, C12, L10, L81.
1 Introduction
In this paper we introduce a new approach to differentiated product demand estimation that allows
for zeroes in empirical market share data. Such zeroes are a highly prevalent feature of demand in
a variety of empirical settings, ranging from workhorse scanner retail data, to data as diverse as
∗Previous version of this paper was circulated under the title “Estimating Demand for Differentiated Productswith Error in Market Shares.”†We are thankful to Steven Berry, Jean-Pierre Dube, Philip Haile, Bruce Hansen, Ulrich Muller, Aviv Nevo, Jack
Porter, and Chris Taber for insightful discussions and suggestions; We would also like to thank the participants atthe MIT Econometrics of Demand Conference, Chicago-Booth Marketing Lunch, the Northwestern Conference on“Junior Festival on New Developments in Microeconometrics,” the Cowles Foundation Conference on “StructuralEmpirical Microeconomic Models,” 3rd Cornell - Penn State Econometrics & Industrial Organization Workshop, aswell as seminar participants at Wisconsin-Madison, Wisconsin-Milwaukee, Cornell, Indiana, Princeton, NYU, Pennand the Federal Trade Commission for their many helpful comments and questions.
1
homicide rates and international trade flows (we discuss these examples in further depth below).
Zeroes naturally arise in “big data” applications which allow for increasingly granular views of
consumers, products, and markets (see for example Quan and Williams (2015), Nurski and Verboven
(2016)). Unfortunately, the standard estimation procedures following the seminal Berry, Levinsohn,
and Pakes (1995) (BLP for short) cannot be used in the presence of zero empirical shares - they
are simply not well defined when zeroes are present. Furthermore, ad hoc fixes to market zeroes
that are sometimes used in practice, such as dropping zeroes from the data or replacing them with
small positive numbers, are subject to biases which can be quite large (discussed further below).
This has left empirical work on demand for differentiated products without satisfying solutions to
the zero shares problem, which is the key void our paper aims to fill.
In this paper we provide an approach to estimating differentiated product demand models that
provides consistency (and asymptotic normality) for demand parameters despite a possibly large
presence of market zeroes in the data. We first isolate the econometric problem caused by zeroes
in the data. The problem we show is driven by the wedge between choice probabilities, which
are the theoretical outcome variables predicted by the demand model, and market shares, which
are the empirical revealed preference data used to estimate choice probabilities. Although choice
probabilities are strictly positive in the underlying model, market shares are often zero if choice
probabilities are small. The root of the zeroes problem is that substituting market shares (or
some other consistent estimate) for choice probabilities in the moment conditions that identify the
model, which is the basis for the traditional estimators, will generally lead to asymptotic bias.
While this bias is assumed away in the traditional approach, it cannot be avoided whenever zeroes
are prevalent in the data.
Our solution to this problem is to construct a set of moment inequalities for the model, which
are by design robust to the sampling error in market shares - our moment inequalities will hold at
the true value of the parameters regardless of the magnitude of the measurement error in market
shares for choice probabilities. Despite taking an inequality form, we use these moment inequalities
to form a GMM-type point estimator based on minimizing the deviations from the inequalities.
We show this estimator is consistent so long as there is a positive mass of observations whose
latent choice probabilities are bounded sufficiently away from zero, e.g., products for whom market
shares are not likely to be zero. This is natural in many applications (as illustrated in Section 2),
and strictly generalizes the restrictions on choice probabilities for consistency under the traditional
approach. Asymptotic normality then follows by adapting arguments from censored regression
models by Kahn and Tamer (2009).
Computationally, our estimator closely resembles the traditional approach with only a slight
adjustment in how the empirical moments are constructed. In particular it is no more burdensome
than the usual estimation procedures for BLP and can be implemented using either the standard
nested fixed point method of the original BLP, or the MPEC method as advocated more recently
by Dube, Fox, and Su (2012).
We investigate the finite sample performance of the approach in a variety of mixed logit ex-
2
amples. We find that our estimator works well even when the the fraction of zeros is as high as
95%, while the standard procedure with the observations with zeroes deleted yields severely biased
estimators even with mild or moderate fractions of zeroes.
We apply our bounds approach to widely used scanner data from the Dominicks Finer Foods
(DFF) retail chain. In particular, we estimate demand for the tuna category as previously studied
by Chevalier, Kashyap, and Rossi (2003) and continued by Nevo and Hatzitaskos (2006) in the
context of testing the loss leader hypothesis of retail sales. We find that controlling for products
with zero demand using our approach gives demand estimates that can be more than twice as
elastic than standard estimates that select out the zeroes. We also show that the estimated price
elasticities do not increase during Lent, which is a high demand period for this product category,
after we control for the zeroes. Both of these findings have implications for reconciling the loss-
leader hypothesis with the data.
The plan of the paper is the following. In Section 2, we illustrate the stylized empirical pattern
of Zipf’s law where market zeroes naturally arise. In Section 3, we describe our solution to the
zeroes problem using a simple logit setup without random coefficients to make the essential matters
transparent. In Section 4, we introduce our general approach for discrete choice model with random
coefficients. Section 5 and 6 present results of Monte Carlo simulations and the application to the
DFF data, respectively. Section 7 concludes.
2 The Empirical Pattern of Market Zeroes
In this section we highlight some empirical patterns that arise in applications where the zero shares
problem arises, which will also help to motivate the general approach we take to it in the paper.
Here we will primarily use workhorse store level scanner data to illustrate these patterns. It is
this same data that will also be used for our empirical application. However we emphasize that
our focus here on scanner data is only for the sake of a concrete illustration of the market zeroes
problem - the key patterns we highlight in scanner data are also present in many other economic
settings where demand estimation techniques are used (discussed further below and illustrated in
the Appendix).
We employ here a widely studied store level scanner data from the Dominick’s Finer Foods
grocery chain, which is public data that has been used by many researchers.1 The data comprises
93 Dominick’s Finer Foods stores in the Chicago metropolitan area over the years from 1989 to
1997. Like other store level scanner data sets, this data set provides demand information (price,
sales, marketing) at a store/week/UPC level, where a UPC (universal product code) is a unique
1For a complete list of papers using this data set, see the website of Dominick’s Database:http://research.chicagobooth.edu/marketing/databases/dominicks/index.aspx
3
bar code that identifies a product2.
Table 1 presents information on the resulting product variety across the different product cat-
egories in data. The first column shows the number of products in an average store/week - the
number of UPC’s can be seen varying from roughly 50 (e.g., bath tissue) to over four hundred
(e.g., soft drinks) within even these fairly narrowly defined categories. Thus there is considerable
product variety in the data. The next two columns illustrate an important aspect of this large
product variety: there are often just a few UPC’s that dominate each product category whereas
most UPC’s are not frequently chosen. The second column illustrates this pattern by showing the
well known “80/20” rule prevails in our data: we see that roughly 80 percent of the total quantity
purchased in each category is driven by the top 20 percent of the UPC’s in the category. In con-
trast to these “top sellers”, the other 80 percent of UPC’s contain relatively “sparse sellers” that
share the remaining 20 percent of the total volume in the category. The third column shows an
important consequence of this sparsity: many UPC’s in a given week at a store simply do not sell.
In particular, we see that the fraction of observations with zero sales can even be nearly 60% for
some categories.
Table 1: Selected Product Categories in the Dominick’s Database
Category
Average
Number of
UPC’s in a
Store/Week
Pair
Percent of
Total Sale of
the Top 20%
UPC’s
Percent of
Zero Sales
Beer 179 87.18% 50.45%
Cereals 212 72.08% 27.14%
Crackers 112 81.63% 37.33%
Dish Detergent 115 69.04% 42.39%
Frozen Dinners 123 66.53% 38.32%
Frozen Juices 94 75.16% 23.54%
Laundry Detergents 200 65.52% 50.46%
Paper Towels 56 83.56% 48.27%
Refrigerated Juices 91 83.18% 27.83%
Soft Drinks 537 91.21% 38.54%
Snack Crackers 166 76.39% 34.53%
Soaps 140 77.26% 44.39%
Toothbrushes 137 73.69% 58.63%
Canned Tuna 118 82.74% 35.34%
Bathroom Tissues 50 84.06% 28.14%
We can visualize this situation another way by fixing a product category (here we use canned
2Store level scanner data can often be augmented with a panel of household level purchases (available, for example,through IRI or Nielsen). Although the DFF data do not contain this micro level data, the main points of our analysisare equally applicable to the case where household level data is available. In fact our general choice model willaccommodate the possibility of micro data. Store level purchase data is actually a special case household level datawhere all households are observationally identical (no observable individual level characteristics).
4
Figure 1: Zipf’s Law in Scanner Data
tuna) and simply plotting the histogram of the volume sold for each week/UPC realization for a
single store in the data. This frequency plot is given in Figure 1. As can be see there is a sharp
decay in the empirical frequency as the purchase quantity becomes larger, with a long thin tail. In
particular the bulk of UPC’s in the store have small purchase volume: the median UPC sells less
than 10 units a week, which is less than 1.5% of the median volume of Tuna the store sells in a
week. The mode of the frequency plot is a zero share.
This power-law decay in the frequency of product demand is often associated with “Zipf’s
law” or the “the long tail”, which has a long history in empirical economics.3 We present further
illustrations of this long-tail demand pattern found in international trade flows as well as cross-
county homicide rates in Appendix A, which provides a sense of the generality of these stylized
facts.
The key takeaway from these illustrations is that the presence of market zeroes in the data is
closely intertwined to the prevalence of power-law patterns of demand. We will later exploit this
relationship to place structure on the data generating process that underlies market zeroes.
3 A First Pass Through Logit Demand
Why do zero shares create a problem for demand estimation? In this section, we use the workhorse
multinomial logit model to explain the zeroes problem and introduce our new estimation strategy.
Formal treatment for general differentiated product demand models is given in the next section.
3See Anderson (2006) for a historical summary of Zipf’s law and many examples from the social and naturalsciences. See Gabaix (1999a) for an application of Zipf’s law to the economics literature.
5
3.1 Zeroes Problem in the Logit Model
Consider a multinomial logit model for the demand of J products (j = 1, . . . , J) and an outside
option (j = 0). A consumer i derives utility uijt = δjt+ εijt from product j in market t, where δjt is
the mean-utility of product j in market t, and εijt is the idiosyncratic taste shock that follows the
type-I extreme value distribution. As is standard, the mean-utility δjt of product j > 0 is modeled
as
δjt = X ′jtβ + ξjt, (3.1)
where Xjt is the vector of observable (product, market) characteristics, often including price, and
ξjt is the unobserved characteristic. The outside good j = 0 has mean utility normalized to δ0t = 0.
The parameter of interest is β.
Each consumer chooses the product that yields the highest utility. Aggregating consumers’
choices, we obtain the true choice probability of product j in market t, denoted as
πjt = Pr(product j is chosen in market t).
The standard approach introduced by Berry (1994) for estimating β is to combine demand system
inversion and instrumental variables.
First, for demand inversion, one uses the logit structure to find that
δjt = ln (πjt)− ln (π0t) , for j = 1, . . . , J. (3.2)
Then, to handle the potential endogeneity of Xjt (correlation with ξjt), one finds a random vector
zjt, such that
E [ξjt| zjt] = 0. (3.3)
Then two stage least squares with δjt defined in terms of choice probabilities as the dependent
variable becomes the identification strategy for β.
Unfortunately πjt is not observed as data - it is a theoretical choice probability defined by the
model but only indirectly revealed through actual consumer choices. The standard approach to this
following Berry (1994), Berry, Levinsohn, and Pakes (1995), and many subsequent papers in the
literature has been to substitute sjt, the empirical market share of product j in market t based on
the choices of n potential consumers, for πjt, and run a two-stage least square with ln (sjt)− ln (s0t)
as dependent variable, xjt as covariates, and zjt as instruments to obtain estimates for β.
Plugging in the estimate sjt for πjt appears innocuous at first glance because the number of
potential consumers (n) in a market from which sjt is constructed is typically large. Nevertheless
problems arise when there are (jt)’s for which πjt is very small. Because the slope of the natural
logarithm function approaches infinity when the argument approaches zero, even small estimation
error of πjt may lead to large error in the plugged-in version of δjt when πjt is very small. In particu-
lar, sjt may frequently equal zero in this case, causing the demand inversion to fail completely. The
first is the theoretical root of the small πjt problem, while the second is an unmistakable symptom.
6
Data sets with this symptom are frequently encountered in empirical research as discussed in
the Section 2. With such data, a common practice is to ignore the (jt)’s with sjt = 0, effectively
lumping those j’s into the outside option in market t. This leads however to a selection problem.
To see this, suppose sjt = 0 for some (j, t) and one drops these observations from the analysis
- effectively one is using a selected sample where the selection criterion is sjt > 0. In this selected
sample, the conditional mean of ξjt is no longer zero, i.e.,
E[ξjt|xjt, sjt > 0] 6= 0. (3.4)
This is the well-known selection-on-unobservables problem and with such sample selection, an
attenuation bias ensues.4 The attenuation bias generally leads to demand estimates that appear
to be too inelastic.5
Another commonly adopted empirical “trick” is to add a small positive number ε > 0 to the
sjt’s that are zero, and use the resulting modified shares sεjt > 0 in place of πjt.6 However, this
trick only treats the symptom, i.e., sjt = 0, but overlooks the nature of the problem: the true choice
probability πjt is small. And in this case, small estimation error in any estimator πjt of πjt would
lead to large error in the plugged-in version of δjt and the estimation of β. This problem manifests
itself directly because the estimate β can be incredibly sensitive to the particular choice of the small
number being added with little guidance on what is the “right” choice of small number. In general,
like selecting away the zeroes, the “adding a small number trick” is also a biased estimator for β.
We illustrate both biases in the Monte Carlo section (Section 5).
Despite their failure as general solutions, these “ad hoc zero fixes” have in them what could be a
useful idea – Perhaps the variation among the non-zero share observations can be used to estimate
the model parameters, while at the same time the presence of zeroes is controlled in such a way
that avoids bias. We now present a new estimator that formalizes this possibility by using moment
inequalities to control for the zeroes in the data while using the variation in the remaining part of
the data to consistently estimate the demand parameters. We continue in this section to illustrate
our approach within the logit model before treatment of the general case in the next section.
3.2 A Bounds Estimator
Our bounds approach turns the selection-on-unobservable problem into a selection-on-observable
strategy, with the key features that the selection is not based on market share but on exogenous vari-
4In fact,E[ξjt|xjt, sjt > 0] > 0 (3.5)
in the homoskedastic case. This is because the criterion sjt > 0 selects high values of ξjt and leaves out low valuesof ξjt.
5It is easy to see that the selection bias is of the same direction if the selection criterion is instead sjt > 0 for allt, as one is effectively doing when focusing on a few top sellers that never demonstrate zero sales in the data. Thereason is that the event sjt > 0 for all t contains the event sjt > 0 for a particular t. If the markets are weaklydependent, the particular t part of the selection dominates.
6Berry, Linton, and Pakes (2004)and Freyberger (2015) study the biasing effect of plugging in sjt for πjt. Theirbias corrections do not apply when there are zeroes in the empirical shares.
7
ables, and is not determined ex-ante by the econometrician but rather automatically performed by
the estimator. Specifically, we assume that there exist a set of “safe product/market” (j, t) , identi-
fied by the instrumental variable zjt, with inherently thick demand such that sjt has a small chance
of being zero. In particular, we assume a partition on the support of zjt: supp(zjt) = Z = Z0 ∪Z1
that separates the safe product/markets (zjt ∈ Z0) from the remaining “risky product/markets”
(zjt ∈ Z1). 7 The safe products have inherently desirable characteristics that often make them the
“top sellers” described in Section 2, while the risky products have less attractive characteristics that
often yield sparse demand. If we knew Z0 and focused on the observations such that zjt ∈ Z0, the
standard estimator would be consistent. The key challenge in the data is that the econometricians
will not know Z0 in advance. Our bounds estimator automatically utilizes the variation in Z0,
but at the same time safely controls for the observations in Z1, to consistently estimate β without
requiring the researcher either to know or to estimate the underlying partition (Z0,Z1).
Our approach first uses two mean-utility estimators: δujt and δ`jt that are functions of empirical
market shares (rather than the true choice probability), to form bounds on E [δjt| zjt]:
E[δujt∣∣ zjt] ≥ E [δjt| zjt] ≥ E
[δ`jt
∣∣∣ zjt] ,∀j, t a.s. (3.6)
where δjt is the true mean-utility in (3.1). Next, the inequalities (3.6) combined with (3.3) imply
E[δujt − x
′jtβ∣∣∣ zjt] ≥ 0 ≥ E
[δ`jt − x
′jtβ∣∣∣ zjt] a.s. (3.7)
Observe that the moment restriction (3.3) implies that
E[(δjt − x
′jtβ)g (zjt)
]= 0 ∀g ∈ G,
where G is a set of instrumental variable functions. Using instead our upper and lower mean utility
estimators in place of the true mean utility we have the following moment inequalities
E[(δujt − x
′jtβ)g(zjt)
]≥ 0 ≥ E
[(δ`jt − x
′jtβ)g(zjt)
]∀g ∈ G. (3.8)
Following Andrews and Shi (2013), we take each g ∈ G to be an indicator function for a hypercube
Bg ⊆ supp (z), i.e.,
g(zjt) = 1 (zjt ∈ Bg) ,
and as long as G is rich enough, identification information in (3.7) is preserved by the moment
equalities (3.8).
7We will formalize the requirement on the partition in Section 4.
8
To form our estimator, define
ρuT (β, g) = (TJ)−1T∑t=1
J∑j=1
(δujt − x
′jtβ)g(zjt)
,
ρ`T (β, g) = (TJ)−1T∑t=1
J∑j=1
(x′jtβ − δ`jt
)g(zjt)
.
Let [a]− denote |min 0, a |. Our estimator is then
βBD = arg minθ
∑g∈G
µ(g)
[ρuT (β, g)]2− +
[ρ`T (θ, g)
]2
−
, (3.9)
where µ(g) is a probability density function on G, that is µ(g) > 0 for all g ∈ G, and∑
g∈G µ(g) = 1.
The function µ(g) is used to ensure summability of the terms, and the choice of µ(·) is discussed
in the next section.
Why is βBD consistent? A heuristic proof is as follows. Let us define the partition G = G0 ∪ G1
where each g ∈ G0 has support inside Z0. This partition does not need to be explicitly formed
by the econometrician (only the flexible set of instrumental variable functions G over the entire
support of zjt in the observed data is needed as an input), but only needs to exist in the underlying
DGP. We can then separate the objective function underlying (3.9) into two additive pieces
∑g∈G0
µ(g)
[ρuT (β, g)]2− +
[ρ`T (β, g)
]2
−
+∑g∈G1
µ(g)
[ρuT (β, g)]2− +
[ρ`T (β, g)
]2
−
. (3.10)
Notice that at the true parameter value β0, each of these sums in (3.10) converges in probability
to 0 because of the validity of the moment inequalities (3.8) at the true value β0. What happens
away from the true value β∗ 6= β0? Observe that the second sum over G1 is by construction
nonnegative regardless of the value of β. The first sum on the other hand approaches for each
g ∈ G0 the square of ∑g∈G0
µ(g)E[(δjt − x
′jtβ∗)g(zjt)
]because ρuT (β, g) and ρlT (β, g) converge as T → ∞ for g ∈ G0 (this is, for products whose zjt lies
in the safe set Z0). Then so long as the instruments g (zjt)g∈G0have sufficient variation for IV
rank condition with xjt to hold (the standard logit identifying condition), we are ensured that for
at least a positive mass of g ∈ G0 we have that
E[(δjt − x
′jtβ∗)g(zjt)
]6= 0.
Thus the first sum in (3.10) will converge in probability to a strictly positive number. Hence
the limiting value of the objective function (3.9) attains a minimum at the true value β0 and thus
9
by standard arguments βBD →p β0.
Figure 2 provides a graphical illustration of the above arguments. In the safe products region Z0,
the bounds are tight and provide identification power, while in Z1, the bounds may be uninformative
but still valid. So instrumental functions such as g1 ∈ G0 will form moment equalities that point
identify the model. Other instrumental functions, such as g2, g3 ∈ G1, are associated with slack
moment inequalities so they do not undermine the identification.
Figure 2: Illustration of Bounds Approach
E[δujt
∣∣∣Zjt]
E[δljt
∣∣∣Zjt]E [δjt|Zjt]
Z0Z1
g1g2
g3
Zjt
The bounds estimator thus controls for the zeroes in the data while using the variation among
the safe products to consistently estimate the model parameters. We now generalize this logic and
formalize it to the general differentiated product demand context with general error distribution
for the random utility model. We will show both consistency and asymptotic normality of the
estimator in this general case.
4 The General Model and Estimator
The researcher has data on a sample of markets t = 1 . . . , T , and for each market t, there is a sample
of individuals i = 1, . . . nt choosing from the j = 0, . . . , Jt products in the market. A product j in
market t is characterized by a vector of characteristics xjt ∈ Rdx that are observed to the researcher,
and a scalar unobserved product attribute ξjt. We will refer to the bundle (xjt, ξjt) as j′s product
characteristics (observed and unobserved). Note that to better match the feature of popular data
sets, we allow a t subscript for J , that is, different markets can have different number of products.
We will also allow a t subscript for n, the number of potential consumers.
10
In discrete choice models each consumer i = 1, . . . , nt in market t is assumed to make a single
choice from the product varieties j = 0, . . . , Jt in the market, where j = 0 denotes the outside
option of not purchasing. This choice is determined by maximizing a utility function that is random
from the perspective of the researcher. Specifically, the utility consumer i derives from consuming
product j in market t is given by
uijt = δjt + εijt,
where:
1. δjt is the mean-utility of product j in market t. Normalize δ0t = 0. As is standard, δjt is
modeled as
δjt = x′jtβ + ξjt, (4.1)
where xjt is the vector of observable (product, market) characteristics, often including price,
and ξjt is the vector of unobservable characteristics;
2. εijt is the idiosyncratic taste shock governed by the following distribution,
εit = (εi0t, . . . , εiJtt) ∼ F (· |xt;λ) , (4.2)
where xt stands for (x′1t, . . . , x
′Jtt
)′, and F (·|xt, λ) is a conditional cumulative distribution
function known up to the finite dimensional unknown parameter λ. Thus, the unknown
parameter in the model is θ = (β′, λ′)′. For clarity, we use θ0 ≡ (β′0, λ′0)′ to denote the true
value of the unknown parameter.
It is worth noting that allowing xt and the parameter λ to enter F makes this specification encom-
pass random coefficient specifications uijt = x′jtβi + ξjt, where βi follows some distribution (e.g.,
joint normal), because one can then view β as the mean of the random coefficients and εijt as the
sum of the products of the de-meaned random coefficients and the product characteristic xjt.8
We assume consumers demand the product that maximizes utility. Thus integrating out εit
yields a system of choice probabilities for agents in the market
σ(δt, xt, λ) ≡ (σ1(δt, xt, λ), . . . , σJt(δt, xt, λ))′,
where δt = (δ1t, . . . , δJtt)′. Then we obtain the demand system
πt ≡ (π1t, . . . , πJt)′ = σ(δt, xt, λ), (4.3)
where πjt = Pr(product j is chosen in market t) represents the true choice probability of product
j in market t. Let σ−1(πt, xt, λ) ≡ (σ−11 (πt, xt, λ), . . . , σ−1
Jt(πt, xt, λ))′ denote the inverse demand
function such that:
δt = σ−1(πt, xt, λ). (4.4)
8Requiring F (·|xt, λ) to be known up to a finite dimensional parameter rules out the vertical model because forthe vertical model, εit is a function of the unobservable product characteristics (quality).
11
Note that in the simple logit model, σj(δt, xt, λ) reduces to σj(δt) =exp(δjt)
1+∑Jtj′=1
exp(δj′t), and σ−1
j (πt, xt, λ)
reduces to σ−1(πt) = ln(πjt)− ln(π0t).
Inverting the demand system allows for the use of instrumental variables to identify θ. In
particular, instruments for the model are a random vector zjt that satisfies
E [ξjt |zjt ] = 0. (4.5)
Combining (4.4) and (4.5), the model yields the following moment restriction:
E[σ−1j (πt, xt, λ)− x′jtβ
∣∣∣ zjt] = 0. (4.6)
If πt is observed, identification can be stated as follows. The model is identified if and only if for
any θ = (β, λ) 6= θ0,
PrF (m∗F (θ, zjt) 6= 0 and zjt ∈ Z = supp (zjt)) > 0,
where
m∗F (θ, zjt) = E[σ−1j (πt, xt;λ)− x′jtβ
∣∣∣ zjt] (4.7)
Primitive conditions for identification are given in Berry and Haile (2014).
4.1 Bounds Estimator in the General Case
Like in the logit case, we construct a pair of inverse demand functions: δujt (λ) and δ`jt (λ), to form
bounds on E[σ−1j (πt, xt, λ)
∣∣∣ zjt], i.e.,
E[δujt (λ)
∣∣ zjt] ≥ E [σ−1j (πt, xt, λ)
∣∣∣ zjt] ≥ E [δ`jt (λ)∣∣∣ zjt] , a.s. (4.8)
These inequalities combined with (4.5) form the moment inequalities that our estimation of θ is
based upon:
E[δujt (λ)− x′jtβ
∣∣∣ zjt] ≥ 0 ≥ E[δ`jt (λ)− x′jtβ
∣∣∣ zjt] , a.s. (4.9)
To construct these upper and lower mean utility estimates δljt (λ) , δujt (λ), we start by applying
the Laplace rule of succession to obtain an initial choice probability estimator that does not have
zeros: sjt =nsjt+1n+J+1 .9 We call this the Laplace share estimator. It is a good estimator for the
choice probabilities when the prior information is only that these probabilities should be positive,
as argued in Jaynes (2003, Chap. 18), and thus provides a good starting point for our construction.
9The Laplace rule of succession was proposed by Pierre-Simon Laplace in the early 19th century to predict theprobability of an event happening given n independent past observations and the prior knowledge that the probabilitymust be strictly between 0 and 1. It is a concept fundamental to modern probability theory despite being widelymisunderstood and criticized. See Jaynes (2003, Chap. 18) for a thorough discussion.
12
We do not use the Laplace share estimator directly in place of πt, but use it to construct bounds
δujt (λ) and δ`jt (λ). Specifically, we define
δujt (λ) = ∆jt (st, xt;λ) + log
(sjt + ηts0t − ηt
)(4.10)
δ`jt (λ) = ∆jt (st, xt;λ) + log
(sjt − ηts0t + ηt
), (4.11)
where
∆jt (st, xt;λ) ≡ σ−1j (st, xt;λ)− log
(sjts0t
), (4.12)
and ηt is a scalar in (0, 1 /(nt + Jt + 1)).
It is instructive to consider the simple logit case, where σ−1j (st, xt;λ) = log
(sjts0t
), the term
∆jt (st, xt;λ) = 0 so the bounds boil down to
δujt = log
(sjt + ηts0t − ηt
)and δ`jt = log
(sjt − ηts0t + ηt
). (4.13)
Thus the tuning term ηt perturbs (both in a positive and negative direction) the Laplace share
for each product one at a time, and δujt, δljt are then formed by applying the logit inversion to this
perturbed share.
Remark 1. Observe that the distance between δujt and δ`jt is large when sjt is small (i.e., respecting
the large error caused by the noise in st) and is negligible when sjt is large. Thus intuitively, for
observations such that zjt ∈ Z0, which defines the safe products in a market, sjt is large with a
high probability so that E[δujt
∣∣∣ zjt] and E[δ`jt
∣∣∣ zjt] closely resemble E [δjt| zjt]. On the other hand,
the difference between E[δujt
∣∣∣ zjt] and E[δ`jt
∣∣∣ zjt] may be large for risky products (i.e., zjt ∈ Z1)
because sjt has a high probability being close to zero. This feature of the construction is a key to
the consistency result to be discussed later.
We now formally establish the validity of the bounds defined by (4.10) and (4.11).
Assumption 1. The conditional distribution of (ntsjt)Jtj=0 given (πjt, xjt, zjt)
Jtj=1 is multinomial
with parameters nt and (πjt)Jtj=0.
Assumption 2. The inverse demand function σ−1j (·, xt, λ) is well-defined and continuous on the
probability simplex ∆Jt ≡ (p1, . . . , pJt) ∈ (0, 1)Jt : 1−∑Jt
j=1 pj > 0 for any xt and any λ.
Lemma 1. Suppose that Assumptions 1 and 2 hold. Then, there exists ηt ∈ (0, 1/(nt + Jt + 1))
such that the inequalities in (4.8) hold at λ = λ0 with δujt(λ) and δ`jt(λ) defined in (4.10) and (4.11).
Remark 2. The scalar ηt is chosen to guarantee equation (4.8). The ηt satisfying (4.8) may depend
on πt, xt, and nt, and thus may itself be a random variable, which makes it appear difficult to
choose. However, we find that a rule of thumb works very well in both our Monte Carlo and
13
empirical exercises. The rule is to choose, for example, ηt = 1−10−3
nt+Jt+1 to start with, and increasing
it to ηt = 1−10−4
nt+Jt+1 , and to ηt = 1−10−5
nt+Jt+1 , and so on, until the estimates stabilize. To see why this
rule of thumb is reasonable, it is useful to note that if one choice, say, η1t , satisfies (4.8), another
choice, say η2t , that lies between η1
t and 1/(nt + Jt + 1) also satisfies (4.8). This is so due to the
monotonicity of right-hand-side of (4.10) and (4.11) in ηt. On the other hand, using ηt’s that are
closer to the boundary 1/(nt + Jt + 1) will generally not hurt estimation precision much because
identification is based on the safe products, for which even the upper bound 1/(nt + Jt + 1) is
negligible relative to sjt with high probability. This suggests that we do not need to know the
precise range of ηt’s that work, but can afford to make a conservative choice, as our rule of thumb
does.
In order to estimate θ based on the moment inequalities (4.9), we first transform the conditional
moments inequalities into unconditional ones, following Andrews and Shi (2013), using a set G of
instrumental functions, where an instrumental function is a function of zjt. The set G that we use
is given below, and it guarantees that (4.9) is equivalent to
E[(δujt(λ)− x′jtβ)g(zjt)
]≥ 0 ≥ E
[(δ`jt(λ)− x′jtβ)g(zjt)
]. (4.14)
Andrews and Shi (2013) discussed many different choices of G including uncountable sets and
countable sets. We only consider countable G sets. Thus, given a data set of J products from T
markets, we can construct a sample criterion function as:
QT (θ) =∑g∈G
[ρuT (θ, g)]2− µ(g) +∑g∈G
[ρ`T (θ, g)
]2
−µ(g), (4.15)
where
ρuT (θ, g) = (T J)−1T∑t=1
Jt∑j=1
(δujt(λ)− x′jtβ)g(zjt)
,
ρ`T (θ, g) = (T J)−1T∑t=1
Jt∑j=1
(x′jtβ − δ`jt(λ))g(zjt)
, (4.16)
where J = T−1∑T
t=1 Jt is the average number of products on a market. The function µ(·) is a
probability distribution on G, which gives weights to each unconditional moment inequality. Our
choice for µ(·) is given below after the choice for G is introduced.
14
Our bound estimator for θ = (β′, λ′)′ is defined as 10
θBDT = arg minθQT (θ). (4.17)
Numerically solving for θBDT is not much different from solving for the standard BLP estimator. As
in the standard procedure, the criterion function is convex in β 11. Thus, it is useful to separate
the minimization problem into two steps:
minλ
minβQT (β, λ). (4.18)
The β minimization can be solved efficiently and accurately even when many control variables are
included in xjt. The λ minimization typically is a low-dimensional problem. One point worth noting
is that the inverse demand functions involved in the quantities δujt(λ) and δ`jt(λ) can be solved by
the same contraction mapping algorithm used in the standard BLP procedure. Alternatively, the
optimization problem (4.18) can be formulated and solved as a MPEC problem using the machinery
of Dube, Fox, and Su (2012).
Now we define the instrumental function collection G and the weight on it µ(·) that we use
in the simulation and the empirical application of this paper.12 For G, we divide the instrument
vector zjt into discrete instruments, zd,jt, and continuous instruments zc,jt. Let the set Zd be the
discrete set of values that zd,jt can take. Normalize the continuous instruments to lie in [0, 1]:
zc,jt = FN(0,1)
(Σ−1/2zc zc,jt
), where FN(0,1)(·) is the standard normal cdf and Σzc is the sample
covariance matrix of zc,jt. The set G is defined as
G = ga,r,ζ(zd, zc) = 1((z′c, z′d)′ ∈ Ca,r,ζ) : Ca,r,ζ ∈ C, where
C = (×dzcu=1((au − 1)/(2r), au/(2r)])× ζ : au ∈ 1, 2, ..., 2r, for u = 1, ..., dzc ,
r = r0, r0 + 1, ..., and ζ ∈ Zd. (4.19)
In practice, we truncate r at a finite value rT . This does not affect the first order asymptotic
property of our estimator as long as rT →∞. For µ(·), we use
µ(ga,r,ζ) ∝ (100 + r)−2(2r)−dzcK−1d for g ∈ Gd,cc, (4.20)
where Kd is the number of elements in Zd. The same µ measure is used and works well in Andrews
and Shi (2013).
10When there is not a partition in the space of zjt that distinguishes the safe products out, the moment inequalities(4.9) partially identify θ. In that case, the confidence set procedure in Andrews and Shi (2013), as well as the profilingapproach in an early version of this paper, may be used for inference. However, in the current version of this paper,we focus on the point identification case, which is much more computationally tractable.
11The convexity can be seen by examining the second order derivative of QT (θ) with respect to β.12We note that appropriate choices of G and µ are not unique. For other possible choices, see Andrews and Shi
(2013).
15
4.2 Consistency and Asymptotic Normality
In the asymptotic framework, we let the number of markets T go to infinity, and let the number of
consumers in each market, nt, be a function of T that also goes to infinity as T does. The number
of products Jt may also be a function of T that goes to infinity as the latter; it may also stay finite.
The key concept behind our approach is the notion of safe products. We define the safe products
according to the value that zjt takes. Let Z0 be a subset of Rdz , where dz is the dimension of zjt.
The product j is said to be a safe product in market t if zjt ∈ Z0. Thus, the instrumental variable
not only induces exogenous variation of the explanatory variables as in standard setup, but also
serves as an identifier of the safe products. The requirements on the set Z0 is listed below.
If j is a safe product in market t, its market share πjt tends to be sufficiently different from
zero, so that the slope of σ−1j (πt, xt;λ) at the true choice probability πt tends not to be huge. As
a result the inverse demand function σ−1j (πt, xt;λ) should be sufficiently close to σ−1
j (πt, xt;λ) for
a consistent estimator πt of πt. Thus, the first requirement is as follows.
Assumption 3. For any estimator πt of πt such that supj=0,...,Jt,t=1,...,T |πjt − πjt| →p 0, we have
(a) supt=1,...,T ;j=1,...,Jt:zjt∈Z0supλ |(σ−1
j (πt, xt;λ)− σ−1j (πt, xt;λ))| →p 0.
(b) supt=1,...,T ;j=0,...,Jt:zjt∈Z0|ln πjt − lnπjt| →p 0.
Remark 3. Assumption 3 is a strict generalization of the key consistency requirement for the
standard estimator as formalized by Assumption A.8 in Freyberger (2015) or Assumption A5 in
Berry, Linton, and Pakes (2004). Our Assumption 3 relaxes their approach by not placing any
restriction on the size of πjt for the risky products (zjt /∈ Z0). For those products, πjt can be
very small (asymptotically, it can approach zero very fast), and σ−1j (st, xt;λ) can be very different
from σ−1j (πt, xt;λ) (asymptotically, the two may not converge to each other), causing standard
estimators to fail.
In order to leverage the mass of safe product/market realizations in Z0 to achieve consistency,
we need to ensure that the variation in Z0 alone is enough to point identify θ0. This is the analogue
of the general identification condition (4.7) holding if we hypothetically selected the sample so that
zjt ∈ Z0.
Assumption 4. For any θ 6= θ0, PrF (m∗F (θ, zjt) 6= 0 and zjt ∈ Z0) > 0, where m∗F (θ, zjt) is defined
in (4.7).
Note that if the econometrician ex-ante knew Z0, then it would be straightforward to implement
the standard GMM estimator on the subsample Z0. But, we do not know Z0 ex-ante. The main idea
behind the design of our bound estimator is to automatically utilizes the identification information
in Z0 while safely controlling for the presence of the risky mass Z1, without requiring the researcher
either to know or to estimate the partition ex-ante.
Assumption 5 below is a regularity condition that guarantees that the identification in the
assumption above can be achieved through the instrumental functions defined in (4.19) in the
fashion of Andrews and Shi (2013). Part (a) is a moment condition that is stronger than needed for
16
obtaining the Andrews and Shi (2013) type result, but the extra strength is used for the consistency
of θBDT later.
Assumption 5. (a) EF [supθ∈Θ |σ−1j (πt, xt;λ)− x′jtβ|1zjt ∈ Z0] <∞.
(b) The set Z0 is a countable disjoint union of elements in C, where C is defined in (4.19).
Lemma 2. Under Assumptions 4 and 5, we have for any θ 6= θ0, there exists a ga,r,ζ ∈ G such
that Ca,r,ζ ⊆ Z0 and
EF [(σ−1j (πt, xt;λ)− x′jtβ)ga,r,ζ(zjt)] 6= 0. (4.21)
A few more standard assumptions are also needed for consistency. These are given next. Let
ρ∗F (θ, g) = EF [(σ−1j (πt, xt;λ)− x′jtβ)g(zjt)]. (4.22)
Assumption 6. (a) At any point λ, σ−1j (πt, xt;λ) is continuous in λ with probability one.
(b) supθ∈Θ
∣∣∣(T J)−1∑T
t=1
∑Jtj=1((σ−1
j (πt, xt;λ)− x′jtβ)g(zjt)− ρ∗F (θ, g))∣∣∣→p 0 for all g ∈ G0.
Assumption 7 (a) below is the same as the analogous condition in Freyberger (2015) and (b) is
weaker than requiring Jt to be bounded.
Assumption 7. (a) maxj,t |sjt − πjt| →p 0 as T →∞.
(b) mint=1,...,T nt → ∞ and maxt=1,...,T Jt/nt → 0.
Define:
ρuF,T (θ, g) = EF [(δujt(λ)− x′jtβ)g(zjt)]
ρ`F,T (θ, g) = EF [(x′jtβ − δ`jt(λ))g(zjt)]. (4.23)
These functions have the T subscript because the ηt that enters δujt(λ) and δ`jt(λ) depends on nt
and Jt, which depend on T . Assumption 8 is a uniform law of large number type requirement,
which is implied by some mild moment existence conditions if the markets are independent from
each other.
Assumption 8. For k = u, `, the functions ρkF,T is well defined, and
supg∈G
∣∣∣∣∣∣(T J)−1T∑t=1
Jt∑j=1
((δkjt(λ0)− β′0xjt)g(zjt)− ρkF,T (θ0, g))
∣∣∣∣∣∣→p 0.
The following theorem shows the consistency of the bound estimator.
Theorem 1. Suppose that Assumptions 1-8 hold. Then
‖θBDT − θ0‖ →p 0. (4.24)
17
More Assumptions are need to derive the asymptotic normality of the bound estimator. These
conditions are technical rather than illuminating. Thus, we relegate them to Appendix C.1.
Theorem 2. Suppose that Assumptions 1-8 and C.1-C.7 hold. Then√T J(θBDT − θ0)→d N(0,ΓV Γ),
where Γ =[∑
g∈G0
∂ρ∗F (θ0,g)∂θ
∂ρ∗F (θ0,g)∂θ′ µ(g)
]−1, and
V =∑
g,g∗∈G0
Σ(g, g∗)∂ρ∗F (θ0, g)
∂θ
∂ρ∗F (θ0, g∗)
∂θ′µ(g)µ(g∗), with (4.25)
G0 = ga,r,ζ ∈ G : Pr((z′c, zd)′ ∈ Ca,r,ζ) = Pr((z′c, zd)
′ ∈ Ca,r,ζ ∩ Z0), and Σ(g, g∗) being the limit of
of
Cov
(T J)−1/2T∑t=1
Jt∑j=1
(σ−1j (πt, xt;λ)− β′xjt)g(zjt), (T J)−1/2
T∑t=1
Jt∑j=1
(σ−1j (πt, xt;λ)− β′xjt)g∗(zjt)
.
A standard bootstrap procedure can be used to estimate the standard deviation of the estimator
in practice and we shall discuss the implementation details of this procedure in the empirical section.
4.3 Partial Identification as an Alternative
The approach above provides a consistent point estimator based on an underlying set of moment
inequalities. Point estimation relies on Assumptions 3 and 4, which allows for using variation among
safe products for consistency. This is natural in many applications where the long tail pattern is
present and we illustrate its performance in the Monte Carlo below. Nevertheless in settings where
these Assumptions are questionable, we can still use the underlying moment inequalities (4.14) as
a basis for partial identification and inference.
The model (4.14) is a moment inequality model with many moment conditions. One can use the
method developed in Andrews and Shi (2013) to construct a joint confidence set for the full vector
θ0. This confidence set is constructed by inverting an Anderson-Rubin test: CS = θ : T (θ) ≤ c(θ)for some test statistic T (θ) and critical value c(θ). Computing this set amounts to computing the
0-level set of the function T (θ)− c(θ), where c(θ) typically is simulated quantiles and thus a non-
smooth function of θ. This is feasible if the dimension of θ0 is moderate, especially if one has access
to parallel computing technology. If the dimension is high, however, the computational cost gets
exponentially higher, and methods for it have not been well developed.
On the other hand, in demand estimation, θ0 is high dimensional mainly because of many
control variables included in xjt. The coefficients of the control variables are nuisance parameters
that often are of no particular interest. The typical parameters of interest are the price coefficient
or the price elasticities, which are small dimensional. Based on this observation, we propose a
profiling method to profile out the nuisance parameters and only construct confidence sets for a
18
parameter of interest. Since this part of the discussion is rather technical and tangential to our
main contribution, we relegate it to Appendix D. Also, readers are referred to the early version of
this paper (Gandhi, Lu, and Shi (2013)) for Monte Carlo simulations and empirical results using
the profiling approach under partial identification.
5 Monte Carlo Simulations
In this section, we present two sets of Monte Carlo experiments with random coefficient logit models.
The first experiment investigates the performance of our approach with moderate fractions of zero
shares, which should cover most of the empirical scenarios. In the second experiment, we test
our estimator with a data generating process that produces extremely large fractions of zeros; the
purpose is to further illustrate the key idea of our estimator in exploiting the long tail pattern that
is naturally present in the data.
Both experiments use the a random coefficient logit model, where the utility of consumer i for
product j in market t is
uijt = α0 + xjtβ0 + λ0xjtvi + ξjt + εijt,
where vi ∼ N (0, 1) , λ0 is the standard deviation of the random coefficients on xjt, εijt’s are i.i.d.
across i, j and t following Type I extreme value distribution. The parameters of interest are β0
and λ0, while α0 is a nuisance parameter. In both experiments, we set λ0 = .5, β0 = 1 and vary α0
for different designs. We simulate T markets, each with J products.
5.1 Moderately Many Zeroes
In the first experiment, the observed and unobserved characteristics are generated as xjt = j10 +
N (0, 1) and ξjt ∼ N(0, .12
)for each product j in market t. Thus one feature of the design
is that the xjt has some persistence across markets - products with larger index tend to have
higher value of x (which respects the nature of the variation in the scanner data shown in Sec-
tion 2. Finally, the vector of empirical shares in market t, (s0t, s1t, ..., sJt), is generated from
Multinomial(n, [π0t, π1t, ..., πJt]
′)/
n, where n represents the number of consumers in each mar-
ket.13
With the simulated data set (sjt, xjt) : j = 1, ..., JTt=1, we compute our bound estimator
(bound), the standard BLP estimator using st in place of πt and discarding observations with
sjt = 0 (ES), the standard BLP estimator using st (no zeros) in place of πt (LS).
All the estimators require simulating the market shares and solving demand systems for each
trial of λ in optimizing the objective function for estimation. We use the same set of random draws
13The πt has no closed form solution in the random coefficient model, and thus, we compute them via simulation,i.e.,
πjt =1
s
s∑i=1
exp (α0 + xjtβ0 + λ0xjtvi + ξjt)
1 +∑Jk=1 exp (α0 + xktβ0 + λ0xktvi + ξkt)
,
where s = 1000 is the number of consumer type draws (vi).
19
of vi as in the data generating process to eliminate simulation error as it is not the focus of this
paper. BLP contraction mapping method is employed to numerically solve the demand systems.
We simulate 1000 datasets (srt , xrt ) : t = 1, ..., T1000r=1 and implement all the estimators men-
tioned above on each for a repeated simulation study. For the instrumental functions, we use the
countable hyper-cubes defined in (4.19), and set rT = 50. We let η = 1−ιn+J+1 with ι = 10−6 in
constructing the bounds on the conditional expectation of the inverse demand function. Setting
smaller ι, e.g., 10−10 gives virtually the same results as reported in the following tables. For the
BLP estimator, we use(
1, xjt, x2jt − 1, x3
jt − 3xjt
)(the first three Hermite polynomials) as instru-
ments to construct the GMM objective function. Alternative transformations of xjt as instruments
yield effectively the same results.
The bias and standard deviation of the estimators are presented in Table 2. As we can see
from the table, The standard estimator with st shows large bias for both β and λ. Replacing the
empirical share st with the Laplace share st (and thus not discarding the observations with sjt = 0)
increases the bias for β although reducing the bias for λ. Our bound estimators are the least biased,
and its bias is very small for both parameters, especially when the sample size (T ) is larger.
20
Table 2: Monte Carlo Results: Random-Coefficient Logit Model
DGP TAve. % ES Bound LS
of Zeros β λ β λ β λ
I
25 9.53%Bias -.1936 .3706 -.0443 .0436 -.2380 .2938
SD .0185 .0354 .0348 .0474 .0189 .0296
50 9.46%Bias -.1940 .3717 -.0236 .0195 -.2353 .2916
SD .0150 .0271 .0294 .0399 .0146 .0229
100 9.48%Bias -.1939 .3706 -.0081 .0018 -.2347 .2901
SD .0126 .0215 .0235 .0315 .0118 .0191
II
25 18.58%Bias -.6104 .6730 -.0329 .0169 -.4900 .3994
SD .0664 .0841 .0534 .0525 .0319 .0388
50 18.55%Bias -.6036 .6648 -.0040 -.0069 -.4867 .3970
SD .0528 .0662 .0403 .0399 .0242 .0300
100 18.53%Bias -.6018 .6613 .0037 -.0120 -.4865 .3960
SD .0394 .0489 .0299 .0298 .0199 .0250
III
25 41.16%Bias -1.3199 .7299 .0253 -.0344 -1.0112 .3830
SD .3056 .2201 .0725 .0487 .0564 .0476
50 41.12%Bias -1.2937 .7099 .0263 -.0299 -1.0060 .3794
SD .2003 .1418 .0550 .0375 .0430 .0367
100 41.07%Bias -1.2903 .7051 .0112 -.0171 -1.0044 .3762
SD .1435 .1028 .0394 .0282 .0342 .0282
IV
25 52.41%Bias -1.1039 .4041 .0453 -.0461 -1.1613 .2857
SD .2467 .1381 .0939 .0549 .0551 .0416
50 52.38%Bias -1.0969 .3973 .0260 -.0297 -1.1564 .2829
SD .1804 .1017 .0665 .0415 .0422 .0318
100 52.35%Bias -1.0901 .3922 .0104 -.0175 -1.1548 .2805
SD .1335 .0761 .0493 .0327 .0335 .0246
Note: 1. J = 50, N = 10, 000, β0 = 1, λ0 = .5, Number of Repetitions = 1000.
2. “ES”: Empirical Shares; “LS”: Laplace Shares.
3. DGP: I, II, III and IV correspond toα0 = −9, −10, −12 and −13, respectively.
5.2 Extremely Many Zeroes
Next we pressure test our bound estimator by pushing the fraction of zeroes in empirical shares
toward the extreme. We modify the DGP slightly to produce very high fraction of zeros. Specifically,
we generate xjt from the following discrete distribution
x 1 12 15
Pr (xjt = x) .99 .005 .005
and
ξjt ∼ 1 (xjt = 1)×N(0, 22
)+ 1 (xjt 6= 1)×N
(0, .12
).
All the other aspects of the DGP is the identical to the previous DGP.
The fractions of zeroes are made very high: 82%-96% by choosing the α0 parameter. With
such high fractions of zeroes, the vast majority of observations are uninformative. Thus, we need
21
larger sample size for any estimator to perform well. We consider T = 100, 200, 400. For simplicity
of presentation and to reduce computational burden, we will here fix λ at its true value, and only
investigate the behaviors of the estimators for β .
The results are reported in Table 3, and they are very encouraging for the bound approach. The
ES estimator is severely biased toward 0, so is the LS estimator. The bound estimator is remarkably
accurate in these extreme cases. The performance highlights the key idea of identification behind
our estimator: utilizing the information in safe products with inherently thick demand to identify
the model while controlling the risky products with small/zero sales properly.
Table 3: Monte Carlo Results: Very Large Fraction of Zeros
DGP TAve. % β
of Zeros ES Bound LS
I
100 82.91%Bias -.3222 -.0072 -.2643
SD .0272 .0342 .0240
200 82.92%Bias -.3219 -.0072 -.2633
SD .0142 .0095 .0041
400 82.94%Bias -.3194 -.0060 -.2633
SD .0267 .0068 .0031
II
100 89.59%Bias -.3777 -.0059 -.3311
SD .0129 .0133 .0063
200 89.57%Bias -.3777 -.0066 -.3308
SD .0125 .0095 .0045
400 89.55%Bias -.3759 -.0060 -.3308
SD .0230 .0066 .0033
III
100 96.35%Bias -.5613 -.0060 -.5499
SD .0090 .0139 .0090
200 96.36%Bias -.5615 -.0064 -.5498
SD .0069 .0097 .0064
400 96.35%Bias -.5605 -.0061 -.5495
SD .0102 .0071 .0046
Note: 1. T = 100, J = 50, N = 10, 000, β0 = 1, λ0 = .5,
Number of Repetitions = 1000.
2. We fix λ = λ0 (at the true value) without estimating it.
3. DGP: I, II, III correspond to α0 = −13, −14, −17.
6 Empirical Application
In this section, we apply our estimator on the same DFF scanner data previewed in Section 2. In
particular, we focus on the canned tuna category, as previously studied by Chevalier, Kashyap, and
Rossi (2003) (CKR for short) and Nevo and Hatzitaskos (2006) (NH for short). CKR observed
using the DFF data discussed in Section 2 that the share weighted price of tuna fell by 15 percent
during Lent (which we replicate below in our sample from the same data source), which is a high
demand period for this product. They attributed the outcome to loss-leading behavior on the part
22
of retailers. NH on the other hand suggest that this pricing pattern in the tuna data could instead
be explained by increased price sensitivity of consumers (consistent with an increase in search)
which causes a re-allocation of market shares towards less expensive products in the Lent period,
and hence a fall in the observed share weighted price index. They test this hypothesis directly in
the data by estimating demand parameters separately in the Lent and Non-Lent periods, and find
that demand becomes more elastic in the high demand (Lent) period.
Here we revisit the groundwork laid by NH to examine the difference in price elasticity between
Lent and non-Lent periods. The main difference in our analysis is that we use data on all products in
the analysis, while NH restrict the sample to include only the top 30 UPCs and thus automatically
drop products with small/zero sales. There are two main questions we seek to address are: a) Does
the selection of UPC’s with only positive shares significantly bias the estimates of price elasticity
and b) Does the difference in price elasticities between the Lent and Non-Lent period persist after
properly controlling for zeroes.
To make the comparison clear, we use largely the same specification of the model used in NH.
In particular we consider a logit specification
uijt = αpjt + βxjt + ξjt + εijt,
where the control variables xjt consist of UPC fixed effects and a time trend.14 Thus the week
to week variation in the product-/market-level unobserved demand shock ξjt largely captures the
short-term promotional efforts, e.g., in-store advertising and shelving choices, because the UPC
fixed effects control the intrinsic product quality that is likely to be stable over short time horizon.
Because stores are likely to advertise or shelf the product in a more prominent way during weeks
when the product is on a price sale, we expect a negative correlation between price and the unob-
servable. We construct instruments for price by inverting DFF’s data on gross margin to calculate
the chain’s wholesale costs, which is the standard price instrument in the literature that has studied
the DFF data.15
We implement our bound estimator defined by (4.17) to obtain point estimate of (α, β) in the
model. And the 95% confidence interval for the parameters are obtained using a standard bootstrap
procedure16.
14Empirical market shares are constructed using quantity sales and the number of people who visited the storethat week (the customer count) as the relevant market size.
15The gross margin is defined as (retail price - wholesale cost)/retail price, so we get wholesale cost using retailprice×(1 - gross margin). The instrument defensible in the store disaggregated context we consider here because ithas been shown that price sales in retail price primarily reflect a reduction in retailer margins rather than a reductionin marginal costs (see e.g., Chevalier, Kashyap, and Rossi (2003) and Hosken and Reiffen (2004)). Thus sales (andhence promotions) are not being driven by the manufacturer through temporary reduction in marginal costs.
16The procedure contains the following steps: 1) draw with replacement a bootstrap sample of markets, denoted
as t1, ..., tT ; 2) compute the bound estimator θBD∗T using the bootstrap sample; 3) repeat 1)-2) for BT times and
obtain BT independent (conditional on the original sample) copies of θBD∗T ; 4) q∗T (τ) is the τ -th quantile of the BT
copies of(θBD∗T − θBDT
), then the 95% bootstrap confidence interval is
[θBDT − q∗T (.975) , θBDT − q∗T (.025)
].
23
The estimation results are presented in Table 4 and 5. 17 Table 4 shows that standard logit
estimator that inverts empirical shares to recover mean utilities (and hence drops zeroes) has a
significant selection bias towards zero. The UPC level elasticities for the logit model are small
in economic magnitude, with the average elasticity in the data being -.572. Furthermore, over
90% percent of products having inelastic demand. Using our bounds approach instead to control
for zeroes has a major effect on the estimated elasticities. Average demand elasticity for UPC’s
becomes -1.362 and less than 35% percent of observations have inelastic demand. This change in
the direction of elasticities is consistent with the attenuation bias effects of dropping products with
small/zero market shares.
Table 4: Demand Estimation ResultsBLP Bound
Price Coefficient -.390 -.91095% CI [-.40, -.38] [-1.06, -.81]
Ave. Own Price Elasticity -.572 -1.362Fraction of Inelastic Products 90.04% 33.79%
No. of Obs. 862,683 959,331
Table 5: Demand in Lent vs. Non-LentBLP Bound
Lent Non-Lent Lent Non-Lent
Price Coefficient -.518 -.371 -.743 -.91195% CI [-55, -.48] [-.38, -.36] [-.84, -.45] [-1.01, -.65]
Ave. Own Price Elasticity -.757 -.544 -1.09 -1.302Fraction of Inelastic Products 84.02% 92.84% 43.65% 35.00%
No. of Obs. 70,496 792,187 78,838 880,493
Our second result is that we do not find evidence to suggest demand is becoming more elastic in
the high demand period, as shown in Table 5. Using the standard logit estimator with zeroes being
dropped shows findings consistent with Nevo and Hatzitaskos (2006) - demand appears more elastic
in the high demand Lent period. On the contrary, this effect disappears, and marginally changes
signs, under our bounds estimator that controls for the zeroes. Thus we do not see evidence in our
estimation of price elasticity being higher during the high demand period.
This finding can be rationalized if the magnitude of the selection problem with dropping zeroes
is different across the two periods. Such a change in the distribution of the unobservable ξjt in the
Lent period is indeed consistent with several features of the data. To see this, let us first recall that
the main reduced form fact in the data documented Nevo and Hatzitaskos (2006) that suggested
17In principle we can estimate our model separately for each store, letting preferences change freely over storesdepending on local preferences. These results are available upon request. Here we present for the results of demandpooling together all stores together as was done by Nevo and Hatzitaskos (2006). The store level regressions resultsare very similar to the pooled store regression and the latter is a more concise summary of demand behavior that wepresent here.
24
a change in price sensitivity in the Lent period. We replicate this reduced form finding in Table
6, which shows that although the price index of tuna during Lent appears to be approximately 15
percent less expensive than other weeks (as previously underscored by CKR), the average price of
tuna is virtually unchanged between the Lent versus non-Lent period. Hence it is a re-allocation of
demand towards less expensive products during Lent that drives the change in the aggregate price
index.
Table 6: Regression of Price Index on LentP P
(Price Index) (Average Price)
Lent -.150 -.009s.e. (.0005) (.0003)
We take this decomposition one step further than NH, and examine the price index separately
for products “on sale” and “regularly priced” during these periods.18 As can be seen in Table 7,
it is the sales price index that is the key driver of the aggregate price index being cheaper during
Lent. However the average price of an “on-sale” product is not cheaper in the Lent period. This
shows that it is a re-allocation towards more steeply discounted “on-sale” product during Lent
that is driving this change in the aggregate price index. But we do not see a corresponding such
reallocation for “regularly priced” products.
Table 7: Regression of Sales Price Index on LentP P
(Price Index) (Average Price)
Sale Regular Sale Regular
Lent -.199 .035 .010 .001s.e. (.0017) (.0003) (.0016) (.0003)
This suggests a tighter coordination of promotional effort and discounting in the high demand
period. In effect more steeply discounted products are receiving larger promotional effort on the
part of the retailer during the high demand, which is closer in spirit to the loss-leader hypothesis
originally advanced for this data by CKR. Because promotional effort in the model is largely
captured through the unobservable ξjt, this change in behavior of the unobservable would also
account for the selection effect due to dropping zeroes changing across the two periods. This
hypothesis is also consistent with our estimated model: the correlation between pjt and ξjt among
products that are flagged as being on sale (having at least a 5% reduction from highest price of
previous 3 weeks) increases from -.16 to -.24 between the Non-Lent and Lent periods.
18We flag an observation in the data as being on sale if that particular UPC in that particular store in thatparticular week has at least a 5% reduction from highest price of previous 3 weeks.
25
7 Conclusion
We have shown that differentiated product demand models have enough content to construct a
system of moment inequalities that can be used to consistently estimate demand parameters despite
a possibly large presence of observations with zero market shares in the data. We construct a GMM-
type estimator based on these moment inequalities that is consistent and asymptotically normal
under assumptions that are a reasonable approximation to the DGP in many product differentiated
environments. Our application to scanner data reveals that taking the market zeroes in the data
into account has economically important implications for price elasticities.
A key message from our analysis is that it is critical to not ignore the zero shares when estimating
discrete choice models with disaggregated market data. And a potentially fruitful area for future
research is the application of our approach is individual level choice data, such as a household panel.
Aggregating over households is still necessary to control for price endogeneity, such as described
by Berry, Levinsohn, and Pakes (2004) and Goolsbee and Petrin (2004), and thus zero market
shares when we aggregate over limited sample of households in the data is a clear problem for
many contexts. Nevertheless the demographic richness in the household panel provides additional
identifying power for random coefficients. The approach we describe can offer a novel solution to
the joint problem of endogenous prices and flexible consumer heterogeneity with micro data, which
we plan to pursue in future work.
26
A Further Illustrations of Zipf’s Law
In Figure 3 we illustrate this regularity using data from the two other applications that were
mentioned in Section 2: homicide rates and international trade flows. The left hand graph shows
the annual murder rate (per 10,000 people) for each county in the US from 1977-1992 (for details
about the data see Dezhbakhsh, Rubin, and Shepherd (2003)). The right hand side graph shows the
import trade flows (measured in millions of US dollars) among 160 countries that have a regional
trade agreement in the year 2006 (for details about the data see Head, Mayer, et al. (2013)). In each
of these two cases we see the characteristic pattern of Zipf’s law - a sharp decay in the frequency
for large outcomes and a large mass near zero (with a mode at zero in each case).
Figure 3: Zipf’s Law in Crime and Trade Data
27
B Proofs Lemma 1 and Theorem 1
B.1 Proof of Lemma 1
Proof of Lemma 1. First consider the derivation:
E
[ln
(sjt + η
s0t − η
)∣∣∣∣πt, xt]= E
[ln
(ntsjt + 1
nt + Jt + 1+ η
)∣∣∣∣πt, xt]− E [ ln
(nts0t + 1
nt + Jt + 1− η)∣∣∣∣πt, xt]
≥ ln
(1
nt + Jt + 1+ η
)− E
[ln
(ns0t + 1
nt + Jt + 1− η)∣∣∣∣πt, xt]
≥ ln
(1
nt + Jt + 1+ η
)− ln
(nt + 1
nt + Jt + 1− η)
Pr(ns0t ≥ 1|πt)−
ln
(1
nt + Jt + 1− η)
Pr(nts0t = 0|πt)
≥ ln
(1
nt + Jt + 1+ η
)− ln
(nt + 1
nt + Jt + 1− η)− ln
(1
nt + Jt + 1− η)
(1− π0t)n
≥ ln
(1 + η(nt + Jt + 1)
nt + 1 + η(nt + Jt + 1)
)− ln
(1
nt + Jt + 1− η)
(1− π0t)nt , (B.1)
where the first inequality holds because ntsjt ≥ 0, the second inequality holds because nts0t ≤ nt,
the third inequality holds by Pr(nts0t ≥ 1|πt) ≤ 1 and Assumption 1. As η approaches 1/(nt+Jt+1)
from below, the right-hand-side diverges to positive infinity. Therefore, for any finite (πt, xt)-
measuable quantity, there exists an ηt ∈ (0, 1/(nt + Jt + 1)) such that E[
ln(sjt+ηs0t−η
)∣∣∣πt, xt] is
greater than this quantity when η = ηt.
Next, define the ε-shrinkage of the J dimensional simplex be ∆εJ = (p1, . . . , pJ) ∈ (0, 1)J : pj ≥
ε, 1−∑J
j=1 pj ≥ ε. By the definition of the Laplace share, ∆jt(st, xt, λ0) lies in the interval minπ∈∆
1/(nt+Jt+1)Jt
∆jt(π, xt, λ0), maxπ∈∆
1/(nt+Jt+1)Jt
∆jt(π, xt, λ0)
. (B.2)
The interval is well-defined and finite by Assumption 2. Similarly, δjt(λ0) is finite. Therefore, there
exists ηt such that
E
[ln
(sjt + ηts0t − ηt
)∣∣∣∣πt, xt] ≥ − minπ∈∆
1/(nt+Jt+1)Jt
∆jt(π, xt, λ0) + δjt(λ0)
≥ −E[∆jt(st, xt, λ0)|πt, xt] + δjt(λ0). (B.3)
This shows that E[δujt(λ0)|πt, xt] ≥ δjt(λ0), which implies that
E[δujt(λ0)|zjt] ≥ E[δjt(λ0)|zjt]. (B.4)
28
This proves the upper bound part of (4.8). The lower bound part is analogous and thus omitted.
B.2 Proof of Theorem 1
Next, we prove Theorem 1. To do so, we present three lemmas first. Proofs of these lemmas are
presented after that of Theorem 1. Consider the subset of the instrumental function collection:
G0 = ga,r,ζ ∈ G : Pr((z′c, zd)′ ∈ Ca,r,ζ) = Pr((z′c, zd)
′ ∈ Ca,r,ζ ∩ Z0). (B.5)
Let
Q∗0(θ) =∑g∈G0
(ρ∗F (θ, g))2µ(g). (B.6)
Lemma 3. Suppose that Assumptions 1-6 hold. Then for any c > 0,
infθ∈Θ:‖θ−θ0‖>c
Q∗0(θ) > 0. (B.7)
Let Q0,T (θ) =∑
g∈G0
([ρuT (θ, g)]2− +
[ρ`T (θ, g)
]2−
)µ(g)
.
Lemma 4. Suppose that Assumptions 3-7 hold. Then, supθ∈Θ |Q0,T (θ)−Q∗0(θ)| →p 0.
Lemma 5. Suppose that Assumption 8 hold. Then,
QT (θ0) = op(1). (B.8)
Proof of Theorem 1. Consider an arbitrary c > 0. Let q = infθ∈Θ:‖θ−θ0‖>cQ∗0(θ). Then q > 0. The
theorem is implied by the following derivation:
Pr(‖θBDT − θ0‖ > c
)≤ Pr
(Q∗0,T (θBDT ) ≥ q
)= Pr
(Q∗0(θBDT )− Q0,T (θBDT ) + Q0,T (θBDT ) ≥ q
)≤ Pr
(supθ∈Θ|Q∗0(θ)− Q0,T (θ)|+ Q0,T (θBDT ) ≥ q
)≤ Pr
(supθ∈Θ|Q∗0(θ)− Q0,T (θ)|+ QT (θBDT ) ≥ q
)≤ Pr
(supθ∈Θ|Q∗0(θ)− Q0,T (θ)|+ QT (θ0) ≥ q
)≤ Pr
(supθ∈Θ|Q∗0(θ)− Q0,T (θ)| ≥ q/2
)+ Pr
(QT (θ0)) ≥ q/2
)→ 0, (B.9)
29
where the first inequality holds by Lemma 3, the third inequality holds because QT (θ) differs from
Q0,T (θ) only in that the forms takes the integral over a larger range, and because the common
integrant of both are non-negative, the fourth inequality holds because QT (θBDT ) ≤ QT (θ0) by the
definition of θBDT and the convergence holds by Lemmas 4 and 5.
Proof of Lemma 3. Consider a sequence θm∞m=1 such that
limm→∞
Q∗0(θm) = infθ∈Θ:‖θ−θ0‖>c
Q∗0(θ). (B.10)
Because Θ is compact, it is without loss of generality to assume that limm→∞ θm = θ∗ for some
θ∗ ∈ Θ.
Because ‖θm − θ0‖ > c for all m, we have
‖θ∗ − θ0‖ ≥ c. (B.11)
That is, θ∗ 6= θ0. By Lemma 2, there exists a g∗ ∈ G0 such that ρ∗F (θ∗, g∗) 6= 0. Next, we show that
limm→∞
ρ∗F (θm, g∗) = ρ∗F (θ∗, g∗). (B.12)
Once this is established, the result of the lemma is implied by
infθ∈Θ:‖θ−θ0‖>c
Q∗0(θ) = limm→∞
Q∗0(θm) ≥ limm→∞
µ(g∗)ρ∗F (θm, g∗)2 = µ(g∗)ρ∗F (θ∗, g∗)2 > 0. (B.13)
We show (B.12) using the dominated convergence theorem (DCT). By Assumption 6(a), we have
with probability one,
(σ−1j (πt, xt, λm)− β′mxjt)g∗(zjt)→ (σ−1
j (πt, xt, λ∗)− β∗,′xjt)g∗(zjt). (B.14)
Also observe that |(σ−1j (πt, xt, λm) − β′mxjt)g∗(zjt)| ≤ supθ∈Θ |(σ−1
j (πt, xt, λ) − β′xjt)|1zjt ∈ Z0.The right-hand-side is integrable by Assumption 5(a). Therefore, the DCT applies and yields
(B.12)
Proof of Lemma 4. First, we show that,
supg∈G0
supλ
∣∣∣∣∣∣(T J)−1T∑t=1
Jt∑j=1
(δujt(λ)− σ−1j (πt, xt;λ))g(zjt)
∣∣∣∣∣∣→p 0. (B.15)
30
To show this, observe that with probability one, the left-hand-side is less than or equal to
supt,j:zjt∈Z0
supλ|(σ−1
j (st, xt;λ)− σ−1j (πt, xt;λ)|+
supt,j:zjt∈Z0
∣∣∣∣ln( sjt + η
s0t − η
)− ln
(sjts0t
)∣∣∣∣ (B.16)
The in-probability convergence of this to zero is implied by Assumptions 3 and the fact that
η ≤ 1/(nt+Jt+1)→ 0, as long as we can show that maxj=0,...,Jt;t=1,...,T |sjt−πjt| →p 0. The latter
convergence is true by the following derivation:
maxj;t|sjt − πjt|
≤ maxj;t|sjt − πjt + (nt + Jt + 1)−1 − (Jt + 1)sjt/(nt + Jt + 1)|
≤ maxj;t|sjt − πjt|+ |n−1
t |+ |(Jt + 1)/nt|
→p 0, (B.17)
where the convergence holds by Assumption 7. Therefore, (B.15) is proved.
Equation (B.15) and Assumption 6(b) together shows that for any g ∈ G0,
supg∈G0
supθ∈Θ|ρuT (θ, g)− ρ∗F (θ, g)| →p 0 (B.18)
Similarly, for any g ∈ G0,
supg∈G0
supθ∈Θ
∣∣∣ρlT (θ, g) + ρ∗F (θ, g)∣∣∣→p 0 (B.19)
Then we can show that
supθ∈Θ|Q0,T (θ)−Q∗0(θ)| ≤
∑g∈G0
supθ∈Θ|[ρuT (θ, g)]2− + [ρlT (θ, g)]2− − ρ∗F (θ, g)2|µ(g)
≤∑g∈G0
supθ∈Θ|[ρuT (θ, g)]2− − [ρ∗F (θ, g)]2−|µ(g)+
∑g∈G0
supθ∈Θ|[ρ`T (θ, g)]2− − [−ρ∗F (θ, g)]2−|µ(g)
= op(1). (B.20)
This concludes the proof of the lemma.
Proof of Lemma 5. The lemma is immediately implied by Assumption 8 and equation (4.14).
31
C Assumptions and Proof of Asymptotic Normality
C.1 Additional Assumptions for Asymptotic Normality
We derive the asymptotic normality of our bound point estimator using similar techniques as Khan
and Tamer (2009).
Additional assumptions are needed. We divide the assumptions in to two groups. The first
group, Assumptions C.1-C.3 are needed for deriving the convergence rate. On top of those, As-
sumptions C.4-C.7 are needed for the asymptotic normality.
Assumption C.1. For an ε > 0 and an open ball, Bε(θ0), of radius ε around θ0,
supg∈G,θ∈Bε(θ0)
∣∣ρuT (θ, g)− ρuF,T (θ, g)∣∣+ |ρ`T (θ, g)− ρ`F,T (θ, g)|
= Op((T J)−1/2).
Assumption C.2. (a) σ−1j (πt, xt;λ) is continuously differentiable in λ in Bε(λ0)—an open ball
around λ0, and E‖xjt‖ <∞.
(b) E supλ∈Bε(λ0) ‖∂σ−1j (πt, xt;λ)/∂λ‖ <∞ and E‖xjt‖ <∞.
(c)∑
g∈G0
∂ρ∗F (θ0,g)∂θ
∂ρ∗F (θ0,g)∂θ′ µ(g) is positive definite.
Assumption C.3. For an ε > 0, we have
supg∈G0,θ∈Bε(θ0)
∣∣ρuF,T (θ, g)− ρ∗F (θ, g)∣∣+∣∣∣ρ`F,T (θ, g) + ρ∗F (θ, g)
∣∣∣ = O((T J)−1/2).
Assumption C.4. (a) (T J)1/2ρuF,T (θ0, g)→∞ and (T J)1/2ρ`F,T (θ0, g)→∞ for all g ∈ G\G0.
(b) ρuF,T (θ, g) : g ∈ G : T = 1, 2, 3, . . . and ρ`F,T (θ, g) : g ∈ G, T = 1, 2, 3, . . . are equi-continuous
in θ at θ0,
(c) ρuF,T (θ, g) and ρ`F,T (θ, g) are differentiable in θ for every g ∈ G and T .
Assumption C.5. (a) For any ε > 0, we have supθ∈Bε(θ0),g∈G
∥∥∥∂ρuT (θ,g)∂θ − ∂ρuF,T (θ,g)
∂θ
∥∥∥→p 0, and also
supθ∈Bε(θ0),g∈G
∥∥∥∥∂ρ`T (θ,g)∂θ − ∂ρ`F,T (θ,g)
∂θ
∥∥∥∥→p 0,
(b) ∂ρuF,T (θ,g)
∂θ : g ∈ G, T = 1, 2, 3, . . . , and ∂ρ`F,T (θ,g)
∂θ : g ∈ G, T = 1, 2, 3, . . . are equi-continuous
in θ at θ0.
Define the infeasible sample moment function:
ρ∗T (θ, g) = (T J)−1T∑t=1
Jt∑j=1
(σ−1j (πt, xt;λ)− β′xjt)g(zjt)
. (C.1)
Assumption C.6. (a) supθ∈Bε(θ),g∈G0‖ρuT (θ, g) − ρ∗T (θ, g)‖ = op((T J)−1/2), and supθ∈Bε(θ),g∈G0
‖ρ`T (θ, g) + ρ∗T (θ, g)‖ = op((T J)−1/2).
(b) supθ∈Bε(θ),g∈G0
∥∥∥∂ρuT (θ,g)∂θ − ∂ρ∗T (θ,g)
∂θ
∥∥∥ = op(1), and supθ∈Bε(θ),g∈G0
∥∥∥∂ρ`T (θ,g)∂θ +
∂ρ∗T (θ,g)∂θ
∥∥∥ = op(1).
32
Assumption C.7. (a) (T J)1/2ρ∗T (θ0, ·)→d νΣ(·), where νΣ(g) : g ∈ G0 is a tight Gaussian process
with variance covariance kernel Σ(g, g∗) : (g, g∗) ∈ G20 .
(b) supg∈G0,θ∈Bε(θ0) ‖∂ρ∗T (θ,g)
∂θ − ∂ρ∗F (θ,g)∂θ ‖ = op(1).
C.2 Proof of Asymptotic Normality
We now prove Theorem 2. The proof uses the following lemma. The proof of the lemma is given
after that of Theorem 2.
Lemma C.1. Suppose that Assumptions 1-8 and C.1-C.3 are satisfied. Then
‖θBDT − θ0‖ = Op((T J)−1/2).
Proof of Theorem 2. The criterion function is differentiable. Thus, we have the first-order-condition:
0 =∂QT (θBDT )
∂θ
= 2∑g∈G
[ρuT (θBDT , g)]−∂ρuT (θBDT , g)
∂θµ(g) + 2
∑g∈G
[ρ`T (θBDT , g)]−∂ρ`T (θBDT , g)
∂θµ(g). (C.2)
Consider the derivation: for all g ∈ G\G0,
Pr(ρuT (θBDT , g) < 0) = Pr((T J)1/2(ρuT ((θBDT , g)− ρuF,T (θ0, g)) < −T 1/2ρuF,T (θ0, g))
→ 0, (C.3)
where the convergence holds by Assumptions C.1 and C.4(a). Similarly, we have, for every g ∈ G\G0
Pr(ρ`T (θBDT , g) < 0)→ 0. (C.4)
Thus, for every g ∈ G\G0
Pr
([ρuT (θBDT , g)]−
∂ρuT (θBDT , g)
∂θ= 0
)→ 0 and
Pr
([ρ`T (θBDT , g)]−
∂ρ`T (θBDT , g)
∂θ= 0
)→ 0. (C.5)
Because G\G0 is a countable set, the above convergence implies that, for any subsequence of T∞T=1,
there exists a further subsequence aT ∞T=1 such that
(aT JaT )1/2[ρuaT (θBDaT , g)]−∂ρuaT (θBDaT , g)
∂θ→ 0 and (aT JaT )1/2[ρ`aT (θBDaT , g)]−
∂ρ`aT (θBDaT , g)
∂θ→ 0,
(C.6)
almost surely for every g ∈ G\G0, where JaT =∑aT
t=1 Jt. By the bounded convergence theorem
33
(applied sample path by sample path), we have
∑g∈G\G0
µ(g)
((aT JaT )1/2[ρuaT (θBDaT , g)]−
∂ρuaT (θBDaT , g)
∂θ+ (aT JaT )1/2[ρ`aT (θBDaT , g)]−
∂ρ`aT (θBDaT , g)
∂θ
)→ 0
(C.7)
almost surely. Thus,
∑g∈G\G0
µ(g)
((T J)1/2[ρuT (θBDT , g)]−
∂ρuT (θBDT , g)
∂θ+ (T J)1/2[ρ`T (θBDT , g)]−
∂ρ`T (θBDT , g)
∂θ
)→p 0.
(C.8)
This implies that
∂QT (θBDT )
∂θ
= op((T J)−1/2) + 2∑g∈G0
µ(g)
([ρuT (θBDT , g)]−
∂ρuT (θBDT , g)
∂θ+ [ρ`T (θBDT , g)]−
∂ρ`T (θBDT , g)
∂θ
). (C.9)
Next consider the following derivation:
∑g∈G0
[ρuT (θBDT , g)]−∂ρuT (θBDT , g)
∂θµ(g)−
∑g∈G0
[ρ∗T (θBDT , g)]−∂ρ∗T (θBDT , g)
∂θµ(g)
=∑g∈G0
([ρuT (θBDT , g)]− − [ρ∗T (θBDT , g)]−)(∂ρuT (θBDT , g)
∂θ−∂ρ∗T (θBDT , g)
∂θ)µ(g)
+∑g∈G0
([ρuT (θBDT , g)]− − [ρ∗T (θBDT , g)]−)∂ρ∗T (θBDT , g)
∂θµ(g)
+∑g∈G0
[ρ∗T (θ0, g) +∂ρ∗T (θT , g)
∂θ(θT − θ0)]−(
∂ρuT (θBDT , g)
∂θ−∂ρ∗T (θBDT , g)
∂θ)µ(g). (C.10)
The first summand on the right-hand-side is op((T J)−1/2) by Assumption C.6. The second sum-
mand is op((T J)−1/2) by Assumption C.6(a), and
supg∈G0
‖∂ρ∗T (θBDT , g)
∂θ−∂ρ∗F (θ0, g)
∂θ‖ = op(1), (C.11)
which holds by Assumptions C.5(a)-(b), C.6(b), and C.7(b). The third summand is op((T J)−1/2)
by Assumption C.6(a), C.7(a), Lemma C.1 and
supg∈G0
‖∂ρ∗T (θT , g)
∂θ−∂ρ∗F (θ0, g)
∂θ‖ = op(1), (C.12)
34
which holds similarly to (C.11). Therefore,
∑g∈G0
[ρuT (θBDT , g)]−∂ρuT (θBDT , g)
∂θµ(g) = op((T J)−1/2) +
∑g∈G0
[ρ∗T (θBDT , g)]−∂ρ∗T (θBDT , g)
∂θµ(g). (C.13)
Similarly, we can show that
∑g∈G0
[ρ`T (θBDT , g)]−∂ρ`T (θBDT , g)
∂θµ(g) = op((T J)−1/2)−
∑g∈G0
[−ρ∗T (θBDT , g)]−∂ρ∗T (θBDT , g)
∂θµ(g).
(C.14)
Equations (C.2), (C.9), (C.13), and (C.14) together show that
0 =∂QT (θBDT )
∂θ= op((T J)−1/2) + 2
∑g∈G0
ρ∗T (θBDT , g)∂ρ∗T (θBDT , g)
∂θµ(g). (C.15)
Apply a mean-value expansion of ρ∗T (θBDT , g) around θ0, and we get
op((T J)−1/2) =∑g∈G0
ρ∗T (θ0, g)∂ρ∗T (θBDT , g)
∂θµ(g) +
∑g∈G0
∂ρ∗T (θT , g)
∂θ
∂ρ∗T (θBDT , g)
∂θ′µ(g)
(θBDT − θ0).
(C.16)
Therefore,
(T J)1/2(θBDT − θ0)
= op(1) +
∑g∈G0
∂ρ∗T (θT , g)
∂θ
∂ρ∗T (θBDT , g)
∂θ′µ(g)
−1 ∑g∈G0
ρ∗T (θ0, g)∂ρ∗T (θBDT , g)
∂θµ(g)
→d
∑g∈G0
∂ρ∗F (θ0, g)
∂θ
∂ρ∗F (θ0, g)
∂θ′µ(g)
−1 ∑g∈G0
νΣ(g)∂ρ∗F (θ0, g)
∂θµ(g)
=d N(0,ΓV Γ), (C.17)
where Γ =[∑
g∈G0
∂ρ∗F (θ0,g)∂θ
∂ρ∗F (θ0,g)∂θ′ µ(g)
]−1, and
V =∑
g,g∗∈G0
Σ(g, g∗)∂ρ∗F (θ0, g)
∂θ
∂ρ∗F (θ0, g∗)
∂θ′µ(g)µ(g∗). (C.18)
This concludes the proof of the theorem.
35
Proof of Lemma C.1. Below, we show the following results:
Q0,T (θBDT ) = Op((T J)−1), (C.19)
Q0(θBDT ) ≥ c‖θT − θ0‖2, (C.20)
Q0,T (θBDT )−Q0(θBDT ) = Op((T J)−1) +Op((T J)−1/2)‖θT − θ0‖, (C.21)
for some c > 0.
Equations (C.19)-(C.21) together imply that
c‖θT − θ0‖2 +Op(T−1/2)‖θT − θ0‖ = Op((T J)−1). (C.22)
This implies that
(c1/2‖θT − θ0‖+Op(T−1/2))2 = Op((T J)−1), (C.23)
which then implies the conclusion of the theorem.
Now we show (C.19). Observe that Q0,T (θBDT ) ≤ QT (θBDT ) ≤ QT (θ0). Thus, it suffices to show
that
QT (θ0) = Op((T J)−1). (C.24)
Consider the derivation:
QT (θ0) =∑g∈G
[ρuT (θ0, g)− ρuF,T (θ0, g) + ρuF,T (θ0, g)]2−µ(g)
+∑g∈G
[ρ`T (θ0, g)− ρ`F,T (θ0, g) + ρ`F,T (θ0, g)]2−µ(g)
≤∑g∈G
[ρuT (θ0, g)− ρuF,T (θ0, g)]2−µ(g)
+∑g∈G
[ρ`T (θ0, g)− ρ`F,T (θ0, g)]2−µ(g)
= Op((T J)−1), (C.25)
where the inequality holds because ρ`F,T (θ0, g) ≥ 0, and ρuF,T (θ0, g) ≥ 0 for all g ∈ G by definition,
and the second equality holds by Assumption C.1. This shows (C.19).
36
Next, we show (C.20). Consider the derivation:
Q0(θBDT ) =∑g∈G0
(θBDT − θ0)′∂ρ∗F (θT , g)
∂θ
∂ρ∗F (θT , g)
∂θ′(θT − θ0)µ(g)
= (θBDT − θ0)′
∑g∈G0
∂ρ∗F (θ0, g)
∂θ
∂ρ∗F (θ0, g)
∂θ′µ(g) + op(1)
(θBDT − θ0)
≥ c‖θBDT − θ0‖2, (C.26)
where the first equality holds by a mean-value expansion with θT being a point on the line segment
joining θBDT and θ0, the second equality holds by Assumption C.2(a), and the inequality holds by
Assumption C.2(b) where c is the smallest eigenvalue of∑
g∈G0
∂ρ∗F (θ0,g)∂θ
∂ρ∗F (θ0,g)∂θ′ µ(g)/2. This shows
(C.20).
Finally, we show (C.21). Observe that:
Q0,T (θBDT )−Q0(θBDT ) =∑g∈G0
[ρuT (θBDT , g)]2− − [ρ∗F (θBDT , g)]2−µ(g)
+∑g∈G0
[ρ`T (θBDT , g)]2− − [−ρ∗F (θBDT , g)]2−µ(g). (C.27)
Consider the derivation regarding the first summand in the right-hand-side of the equation above:∑G0
[ρuT (θBDT , g)]2− − [ρ∗F (θBDT , g)]2−dµ(g)
=∑g∈G0
([ρuT (θBDT , g)]− − [ρ∗F (θBDT , g)]−)2 + 2[ρ∗F (θBDT , g)]−([ρuT (θBDT , g)]− − [ρ∗F (θBDT , g)]−)µ(g)
≤∑g∈G0
(ρuT (θBDT , g)− ρ∗F (θBDT , g))2µ(g)
+ 2
∑g∈G0
ρ∗F (θBDT , g)2µ(g)
1/2∑g∈G0
(ρuT (θBDT , g)− ρ∗F (θBDT , g))2µ(g)
1/2
= Op((T J)−1) + 2Op((T J)−1/2)
∑g∈G0
ρ∗F (θBDT , g)2µ(g)
1/2
= Op((T J)−1) +Op((T J)−1/2)
(θBDT − θ0)∑g∈G0
∂ρ∗F (θT , g)
∂θ
∂ρ∗F (θT , g)
∂θ′µ(g)(θBDT − θ0)
1/2
= Op((T J)−1) +Op((T J)−1/2)‖θBDT − θ0‖, (C.28)
where the first equality holds by rearranging terms, the inequality holds by the fact that |[a]− −[b]−| ≤ |a − b| for any a, b ∈ R, and by the Cauchy-Schwarz Inequality, the second equality holds
by Assumptions C.1 and C.3 and Theorem 1, the third equality holds with θT being a point on
37
the line segment joining θBDT and θ0 by a mean-value expansion, and the last equality holds by
Assumption C.2 and Theorem 1. Similarly, we can show∑g∈G0
[ρ`T (θBDT , g)]2− − [−ρ∗F (θBDT , g)]2−µ(g) = Op((T J)−1) +Op((T J)−1/2)‖θBDT − θ0‖. (C.29)
Therefore, (C.21) is shown, and this concludes the proof of the theorem.
References
Anderson, C. (2006): The Long Tail: Why the Future of Business Is Selling Less of More.
Hyperion.
Andrews, D. W. K., and X. Shi (2013): “Inference Based on Conditional Moment Inequality
Models,” Econometrica, 81.
Berry, S. (1994): “Estimating discrete-choice models of product differentiation,” The RAND
Journal of Economics, pp. 242–262.
Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile prices in market equilibrium,”
Econometrica: Journal of the Econometric Society, pp. 841–890.
Berry, S., J. Levinsohn, and A. Pakes (2004): “Differentiated Products Demand Systems
from a Combination of Micro and Macro Data: The New Vehicle Market,” Journal of Political
Economy, 112, 68–104.
Berry, S., O. Linton, and A. Pakes (2004): “Limit theorems for estimating the parameters of
differentiated product demand systems,” Review of Economic Studies, 71(3), 613–654.
Berry, S. T., and P. A. Haile (2014): “Identification in differentiated products markets using
market level data,” Econometrica, 82(5), 1749–1797.
Chevalier, J. A., A. K. Kashyap, and P. E. Rossi (2003): “Why Don’t Prices Rise During
Periods of Peak Demand? Evidence from Scanner Data,” American Economic Review, 93(1),
15–37.
Dezhbakhsh, H., P. H. Rubin, and J. M. Shepherd (2003): “Does capital punishment have a
deterrent effect? New evidence from postmoratorium panel data,” American Law and Economics
Review, 5(2), 344–376.
Dube, J.-P., J. T. Fox, and C.-L. Su (2012): “Improving the numerical performance of static
and dynamic aggregate discrete choice random coefficients demand estimation,” Econometrica,
80(5), 2231–2267.
Freyberger, J. (2015): “Asymptotic theory for differentiated products demand models with
many markets,” Journal of Econometrics, 185(1), 162–181.
38
Gabaix, X. (1999a): “Zipf’s Law and the Growth of Cities,” The American Economic Review,
Papers and Proceedings, 89, 129–132.
Gandhi, A., Z. Lu, and X. Shi (2013): “Estimating Demand for Differentiated Products with
Error in Market Shares,” CeMMAP working paper.
Goolsbee, A., and A. Petrin (2004): “The consumer gains from direct broadcast satellites and
the competition with cable TV,” Econometrica, 72(2), 351–381.
Head, K., T. Mayer, et al. (2013): “Gravity equations: Workhorse, toolkit, and cookbook,”
Handbook of international economics, 4.
Hosken, D., and D. Reiffen (2004): “Patterns of retail price variation,” RAND Journal of
Economics, pp. 128–146.
Jaynes, E. T. (2003): Probability Theory: The Logic of Science. Cambridge University Press, 1st
edn.
Kahn, S., and E. Tamer (2009): “Inference on Randomly Censored Regression Models Using
Conditional Moment Inequalities,” Journal of Econometrics, 152, 104–119.
Nevo, A., and K. Hatzitaskos (2006): “Why does the average price paid fall during high demand
periods?,” Discussion paper, CSIO working paper.
Nurski, L., and F. Verboven (2016): “Exclusive Dealing as a Barrier to Entry? Evidence from
Automobiles,” The Review of Economic Studies, 83(3), 1156.
Quan, T. W., and K. R. Williams (2015): “Product Variety, Across-market Demand Hetero-
geneity, And The Value Of Online Retail,” Working Paper.
39
Online Appendix to “Estimating Demand forDifferentiated Products with Zeroes in Market Share
Data”
In this online appendix, we introduce the profiling approach for models defined by many moment
inequalities. The profiling approach developed here is similar to the penalized resampling approach
in Bugni, Canay, and Shi (2016) for unconditional moment inequality models. Section D describes
the profiling approach and gives the formal results, and Section E presents the proofs of those
results.
D The Profiling Approach
The profiling approach applies to general moment inequality models with many moment inequali-
ties. Thus from this point on, we focus on the moment inequality model:
Eρ(wt, θ, g) ≥ 0 for all g ∈ G, (D.1)
where ρ takes values in Rk. We also let G be a general set of indices that can be either countable
or uncontable. Let µ : G → [0, 1] denote a probability density on G. We assume the data wtTt=1 are
i.i.d. across t.
We assume that there is a parameter of interest, γ0, that is related to θ0 through:
γ0 ∈ Γ(θ0) ⊆ Rdγ , (D.2)
where Γ : Θ → 2Rdγ
is a known mapping where 2Rdγ
denotes the collection of all subsets of Rdγ .
Three examples of Γ are given below:
Example. Γ(θ) = α: γ0 is the price coefficient α0. In the simple logit model, the price coefficient
is all one needs to know to compute the demand elasticity.
Example. Γ(θ) = ej(p, π, θ, x) = (αpj/πj)(∂σj(σ−1(π, x, λ), x, λ)/∂δj): γ0 is the own-price de-
mand elasticity of product j at a given value of the price vector p, the choice probability vector π
and the covariates x.
Example. Γ(θ) = ej(p, π, θ, x) : π ∈ [πl, πu]: γ0 is the demand elasticity of product j at a given
value of the price vector p, the covariates x and at the choice probability vector that is known to lie
between πl and πu. This example is particularly useful when the elasticity depends on the choice
probability but the choice probability is only known to lie in an interval.
Let Γ0 be the identified set of γ0: Γ0 = γ ∈ Rdγ : ∃θ ∈ Θ0 s.t. Γ(θ) 3 γ, where Θ0 = θ ∈ Θ :
Eρ(wt, θ, g) ≥ 0 ∀g ∈ G. The profiling approach constructs a confidence set for γ0 by inverting a
40
test of the hypothesis:
H0 : γ ∈ Γ0, (D.3)
for each parameter value γ. The confidence set is the collection of values that are not rejected by
the test.
Let Γ−1(γ) = θ ∈ Θ : Γ(θ) 3 γ. The test to be inverted uses the profiled test statistic:
TT (γ) = T × minθ∈Γ−1(γ)
QT (θ), (D.4)
where QT (θ) is an empirical measure of the violation to the moment inequalities. The confidence
set of confidence level p is the set of all points for which the test statistic does not exceed a critical
value cT (γ, p):
CST = γ ∈ Rdγ : TT (γ) ≤ cT (γ, p). (D.5)
Notice that the new confidence set only involves computing a dγ-dimensional level set, where dγ is
often 1. The profiling transfers the burden of searching (for low values) over the surface of the non
smooth function T (θ)− c(θ) to searching over the surface of the typically smooth and often convex
function QT (θ).
We choose a critical value, cT (γ, p), of significance level 1− p ∈ (0, 0.5), to satisfy
limT→∞
inf(γ,F )∈H0
Pr F (TT (γ) > cT (γ, p)) ≤ 1− p, (D.6)
where F is the distribution on (wt)Tt=1 and H0 is the null parameter space of (γ, F ). The definition
of H0 along with other technical assumptions are given in Section D.4.19
As a result of (D.6), the confidence set asymptotically has the correct minimum coverage prob-
ability:
lim infT→∞
inf(γ,F )∈H0
PrF (γ ∈ CST ) ≥ p. (D.7)
The left hand side is called the “asymptotic size” of the confidence set in Andrews and Shi (2013).
We achieve the asymptotic size control by deriving an asymptotic approximation for the distribution
of the profiled test statistic TT (γ) that is uniformly valid over (γ, F ) ∈ H0 and simulating the
critical value from the approximating distribution through either a subsampling or a bootstrapping
procedure.
In the next subsections, we describe the test statistic and the critical value in detail and show
that (D.7) holds.
19Note that we use F to denote the distribution of the full observed data vector and thus (γ, F ) captures everything
unknown in the expression PrF (TT (γ) > cT (γ, p)). This notation differs from the traditional literature where the truedistribution of the data is often indicated by the true value of θ, but is standard in the recent partial identificationliterature. See Romano and Shaikh (2008) and Andrews and Shi (2013).
41
D.1 Test Statistic
The test statistic is the QLR statistic (i.e. a criterion-function-based statistic)20
TT (γ) = T × minθ∈Γ−1(γ)
QT (θ) with
QT (θ) =
∫GTS(ρT (θ, g), Σι
T (θ, g))dµ(g), (D.8)
where GT is a truncated/simulated version of G such that GT ↑ G as T → ∞, µ(·) is a probability
measure on G, S(m,Σ) is a real-valued function that measures the discrepancy of m from the
inequality restriction m ≥ 0, and
ρT (θ, g) = T−1T∑t=1
ρ(wt, θ, g),
ΣιT (θ, g) = ΣT (θ, g) + ι× ΣT (θ, 1)
ΣT (θ, g) = T−1T∑t=1
ρ(wt, θ, g)ρ(wt, θ, g)′ − ρT (θ, g)ρT (θ, g)
′. (D.9)
In the above definition, ι is a small positive number which is used because in some form of S defined
in Section D.4, the inverse of ΣιT (θ, g)’s diagonal elements enter, and the ι prevents us from taking
inverse of zeros. In some other forms of S, e.g. the one defined below and used in the simulation
and empirical section of this paper, the ι does not enter the test statistic because S(m,Σ) does not
depend on Σ.
Section D.4 gives the assumptions that the user-chosen quantities S, µ, G and GT should
satisfy. Under those assumptions, we can show that minθ∈Γ−1(γ) QT (θ) consistently estimate
minθ∈Γ−1(γ)QF (θ) where
QF (θ) =
∫GS(ρF (θ, g),Σι
F (θ, g))dµ(g), with
ρF (θ, g) = EF (ρ(wt, θ, g)),
ΣF (θ, g) = CovF (ρ(wt, θ, g)) and ΣιF (θ, g) = ΣF (θ, g) + ιΣF (θ, 1). (D.10)
The symbols “EF ” and “CovF ” denote expectation and covariance under the data distribution F
respectively. Notice that Γ0 depends on F . We make this explicit by changing the notation Γ0 to
Γ0,F for the rest of this paper.
We can also show that minθ∈Γ−1(γ)QF (θ) = 0 if and only if γ ∈ Γ0,F . This result combined
with the consistency of minθ∈Γ−1(γ) QT (θ) implies that TT (γ) diverges to infinity at γ /∈ Γ0,F . That
20Note that we do not follow the traditional QLR test exactly to define TT (γ) = T × minθ∈Γ−1(θ) QT (θ) − T ×minθ∈Θ QT (θ). This is because the validity of our critical value depends on certain monotonicity of the asymptoticapproximation of the test statistic and the monotonicity does not hold with this alternative test statistic due to thesubtraction of T ×minθ∈Θ QT (θ).
42
implies that there is no information loss in using such a test statistic. Lemma D.1 summarizes those
two results. The parameter space H of (γ, F ) appearing in the lemma is defined in Assumption
D.2 in Section D.4.
Lemma D.1. Suppose that Assumptions D.1, D.2, D.4, D.5(a), and D.6(a) and (d) hold. Then for
any (γ, F ) ∈ H,
(a) minθ∈Γ−1(γ) QT (θ)→p minθ∈Γ−1(γ)QF (θ) under F , and
(b) minθ∈Γ−1(γ)QF (θ) ≥ 0 and = 0 if and only if γ ∈ Γ0,F .
In the simulation and the empirical application of this paper, the following choices of S, G, GT and µ
are used mainly for computational convenience. For G, we use the one defined in (4.19). For GT ,
the truncated version of G, we define it to be the same as G except that we let r run from r0 to rT
where rT →∞ as T →∞ in the definition.
For S , we use
S(m,Σ) =k∑j=1
[mj ]2−, (D.11)
where mj is the jth coordinate of m and [x]− = |minx, 0|. There may be efficiency loss from not
weighting the moments using the variance matrix, but this S function brings great computational
convenience because it makes the minimization problem in (D.4) a convex one. For µ(·), we use
µ(ga,r,ζ) ∝ (100 + r)−2(2r)−dzcK−1d for g ∈ Gd,cc, (D.12)
where Kd is the number of elements in Zd. The same µ measure is used and seems to work well in
Andrews and Shi (2013).
D.2 Critical Value
We propose two types of critical values, one based on standard subsampling and the other based
on a bootstrapping procedure with moment shrinking. Both are simple to compute. The bootstrap
critical value may have better small sample properties, and is the procedure we use in the empirical
section.21 It is worth noting that we resample at the market level for both the subsampling and
the bootstrap.
Let us formally define the subsampling critical value first. It is obtained through the standard
subsampling steps: [1] from 1, ..., T, draw without replacement a subsample of market indices
of size bT ; [2] compute TT,bT (γ) in the same way as TT (γ) except using the subsample of markets
corresponding to the indices drawn in [1] rather than the original sample; [3] repeat [1]-[2] ST times
obtain ST independent (conditional on the original sample) copies of TT,bT (γ); [4] let c∗sub (γ, p) be
the p quantile of the ST independent copies. Let the subsampling critical value be
csubT (γ, p) = c∗sub (γ, p+ η∗) + η∗, (D.13)
21The bootstrap procedure here, like in most problems with partial identification, does not lead to higher-orderimprovement.
43
where η∗ > 0 is an infinitesimal number. The infinitesimal number is used to avoid making hard-
to-verify uniform continuity and strict monotonicity assumptions on the distribution of the test
statistic. It can be set to zero if one is willing to make the continuity assumptions. Such infinitesimal
numbers are also employed in Andrews and Shi (2013). One can follow their suggestion of using
η∗ = 10−6.
Let us now define the bootstrap critical value. It is obtained through the following steps: [1]
from the original sample 1, ..., T, draw with replacement a bootstrap sample of size T ; denote
the bootstrap sample by t1, ..., tT , [2] let the bootstrap statistic be
T ∗T (γ) = minθ∈Θ:γ∈Γ(θ)
∫GS(ν∗T (θ, g) + κ
1/2T ρT (θ, g), Σι
T (θ, g))dµ(G), , (D.14)
where ν∗T (θ, g) =√T (ρ∗T (θ, g) − ρT (θ, g)), ρ∗T (θ, g) = T−1
∑Tτ=1 ρ(Xtτ , θ, g), and κT is a sequence
of moment shrinking parameters: κT /T + κ−1T → 0; [3] repeat [1]-[2] ST times and obtain ST
independent (conditional on the original sample) copies of T ∗T (γ); [4] let c∗bt(γ, p) be the p quantile
of the ST copies. Let the bootstrap critical value be
cbtT (γ, p) = c∗bt(γ, p+ η∗) + η∗, (D.15)
where η∗ > 0 is an infinitesimal number which has the same function as in the subsampling critical
value above.
Critical values that are not based on resampling are possible, too. For example, one can define
a critical value similar to the bootstrap one, except with ν∗T (θ, g) replaced by a Gaussian process
with covariance kernel that equals the sample covariance of ρ(wt, θ(1), g(1)) and ρ(wt, θ
(2), g(2)) for
(θ(j), g(j)) ∈ Θ× G, j = 1, 2. For lack of space, we do not discuss such critical values in detail.
D.3 Coverage Probability
We show that the confidence sets defined in (D.5) using either csubT (γ, p) and cbtT (γ, p) have asymp-
totically correct coverage probability uniformly over H0 under appropriate assumptions. The as-
sumptions are given in Section D.4.
Theorem D.1. Suppose that Assumptions D.1-D.3 and D.5-D.7 hold, then
(a) (D.7) holds with cT (γ, p) = csubT (γ, p), and
(b) (D.7) holds with cT (γ, p) = cbtT (γ, p).
D.4 Assumptions
In this section, we list all the technical assumptions required for the profiling approach. The
assumptions are grouped into seven categories. Assumption D.1 restricts the space of θ; Assumption
D.2 restricts the space of (γ, F ), i.e. the parameters that determines the true data generating
process. Assumption D.3 further restricts the space of (γ, F ) to satisfy the null hypothesis γ ∈ Γ0.
Assumption D.4 is the full support condition on the measure µ on G. Assumption D.5 regulates how
44
GT approaches G as T increases. Assumption D.6 restricts the function S(m,Σ) to satisfy certain
continuity, monotonicity and convexity conditions. Assumption D.7 regulates the subsample size
bT and the moment shrinking parameter κT in the bootstrap procedure. Throughout, we let E∗
and E∗ denote outer and inner expectations respectively and Pr∗ and Pr∗ denote outer and inner
probabilities.
Assumption D.1. (a) Θ is compact, (b) Γ is upper hemi-continuous, and (c) Γ−1(γ) is either
convex or empty for any γ ∈ Rdγ .
To introduce Assumption D.2 we need the following extra notation. Let νF (θ, g) : (θ, g) ∈ Θ×Gdenote a tight Gaussian process with covariance kernel
ΣF (θ(1), g(1), θ(2), g(2)) = CovF
(ρ(wt, θ
(1), g(1)), ρ(wt, θ(2), g(2))
). (D.16)
Notice that ΣF (θ, g) = ΣF (θ, g, θ, g).
Let the derivative of ρF (θ, g) with respect to θ be GF (θ, g).
For any γ ∈ Rdγ , let the set Θ0,F (γ) be
Θ0,F (γ) = θ ∈ Θ : QF (θ) = 0 & Γ(θ) 3 γ, (D.17)
We call Θ0,F (γ) the zero-set of QF (θ) under (γ, F ). Note that for any γ ∈ Rdγ , γ ∈ Γ0,F if and
only if Θ0,F (γ) 6= ∅.Let the distance from a point to a set be the usual mapping:
d(a,A) = infa∗∈A
‖a− a∗‖, (D.18)
where ‖ · ‖ is the Euclidean distance.
Let F denote the set of all probability measures on (wt)Tt=1. Let G = G ∪ 1. Let M denote
the set of all positive semi-definite k× k matrices. The following assumption defines the parameter
space H for the pair (γ, F ).
Assumption D.2. The parameter space H of the pairs (γ, F ) is a subset of Rdγ ×F that satisfies:
(a) under every F such that (γ, F ) ∈ H for some γ ∈ Rdγ , the markets are independent and ex
ante identical to each other, i.e. ρ(wt, θ, g)Tt=1 is an i.i.d. sample for any θ, g;
(b) limM→∞ sup(γ,F )∈HE∗F [sup(θ,g)∈Γ−1(γ)×G ||ρ(wt, θ, g)||21||ρ(wt, θ, g)||2 > M] = 0;
(c) the class of functions ρ(wt, θ, g) : (θ, g) ∈ Γ−1(γ) × G is F -Donsker and pre-Gaussian
uniformly over H;
(d) the class of functions ρ(wt, θ, g)ρ(wt, θ, g)′
: (θ, g) ∈ Γ−1(γ) × G is Glivenko-Cantelli
uniformly over H;
(e) ρF (θ, g) is differentiable with respect to θ ∈ Θ, and there exists constants C and δ1 > 0 such
that, for any (θ(1), θ(2)), sup(γ,F )∈H,g∈G ||vec(GF (θ(1), g))− vec(GF (θ(2), g))|| ≤ C × ||θ(1) − θ(2)||δ1,
and
45
(f) ΣιF (θ, g) ∈ Ψ for all (γ, F ) ∈ H and θ ∈ Γ−1(γ) where Ψ is a compact subset of M, and
vec(ΣF (·, g(1), ·, g(2))) : (Γ−1(γ))2 → Rk2
: (γ, F ) ∈ H, g(1), g(2) ∈ G are uniformly bounded and
uniformly equicontinuous.
Remark. Part (a) is the i.i.d. assumption, which can be replaced with appropriate weak dependence
conditions at the cost of more complicated derivation in the uniform weak convergence of the
bootstrap empirical process. Part (b) is standard uniform Lindeberg condition. Part (c)-(d) imposes
restrictions on the complexity of the set G as well as on the shape of ρ(wt, θ, g) as a function of
θ. A sufficient condition is (i) ρ(wt, θ, g) is Lipschitz continuous in θ with the Lipschitz coefficient
being integrable and (ii) the set C in the definition of G forms a Vapnik-Cervonenkis set and Jt is
bounded. The Lipschitz continuity is also a sufficient condition of part (f).
The following assumptions defines the null parameter space, H0, for the pair (γ, F ).
Assumption D.3. The null parameter space H0 is a subset of H that satisfies:
(a) for every (γ, F ) ∈ H0, γ ∈ Γ0,F , and
(b) there exists C, c > 0 and 2 ≤ δ2 < 2(δ1 + 1) such that QF (θ) ≥ C · (d(θ,Θ0,F (γ))δ2 ∧ c) for
all (γ, F ) ∈ H0 and θ ∈ Γ−1(γ).
Remark. Part (b) is an identification strength assumption. It requires the criterion function to
increase at certain minimum rate as θ is perturbed away from the identified set. This assumption
is weaker than the quadratic minorant assumption in Chernozhukov, Hong, and Tamer (2007) if
δ2 > 2 and as strong as the latter if δ2 = 2. Putting part (b) and Assumption D.2(e) together,
we can see that there is a trade-off between the minimum identification strength required and
the degree of Hı¿œlder continuity of the first derivative of ρF (·, g). If ρF (·, g) is linear, δ2 can be
arbitrarily large – the criterion function can increase very slowly as θ is perturbed away from the
identified set.
The following assumption is on the measure µ. For any θ, let a pseudo-metric on G be: ||g(1)−g(2)||θ,F = ||ρF,j(θ, g(1))− ρF,j(θ, g(2))||. This assumption is needed for Lemma D.1 and not needed
for the asymptotic size result Theorem D.1.
Assumption D.4. For any θ ∈ Θ, µ(·) has full support on the metric space (G, || · ||θ,F ).
Remark. Assumption D.4 implies that for any θ ∈ Θ, F and j, if ρF,j(θ, g0) < 0 for some g0 ∈ G,
then there exists a neighborhood N (g0) with positive µ-measure such that ρF,j(θ, g) < 0 for all
g ∈ N (g0).
The following assumption is on the set GT .
Assumption D.5. (a) GT ↑ G as T →∞ and
(b) lim supT→∞ sup(γ,F )∈H0supθ∈Γ−1(γ)
∫G\GT S(
√TρF (θ, g),ΣF (θ, g))dµ(g) = 0.
The following assumptions are imposed on the function S. For a ξ > 0, let the ξ-expansion of
Ψ be Ψξ = Σ ∈M : infΣ1∈Ψ ||vech(Σ)− vech(Σ1)|| ≤ ξ.
46
Assumption D.6. (a) S(m,Σ) : (−∞,∞]k ×Ψξ → R is continuous for some ξ > 0.
(b) There exists a constant C > 0 and ξ > 0 such that for any m1,m2 ∈ Rk and Σ1,Σ2 ∈ Ψξ,
we have |S(m1,Σ1) − S(m2,Σ2)| ≤ C√
(S(m1,Σ1) + S(m2,Σ2))(S(m2,Σ2) + 1)∆, where ∆ =
||m1 −m2||2 + ||vech(Σ1 − Σ2)||.(c) S is non-increasing in m.
(d) S(m,Σ) ≥ 0 and S(m,Σ) = 0 if and only if m ∈ [0,∞]k.
(e) S is homogeneous in m of degree 2.
(f) S is convex in m ∈ Rdm for any Σ ∈ Ψξ.
Remark. We show in the lemma below that Assumption D.6 is satisfied by the example in (D.11)
(which is used in our empirical section) as well as the SUM and MAX functions in Andrews and
Shi (2013):
SUM: S(m,Σ) =k∑j=1
[mj/σj ]2−, and
MAX: S(m,Σ) = max1≤j≤k
[mj/σj ]2−, (D.19)
where σ2j is the jth diagonal element of Σ. Assumptions D.6(b) and (f) rule out the QLR function
in Andrews and Shi (2013): S(m,Σ) = mint≥0(m− t)′Σ−1(m− t).
Lemma D.2. (a) Assumption D.6 is satisfied by the S function in (D.11) for any set Ψ.
(b) Assumption D.6 is satisfied by the SUM and the MAX functions in (D.19) if Ψ is a compact
subset of the set of positive semi-definite matrix with diagonal elements bounded below by some
constant ξ2 > 0.
The following assumptions are imposed on the tuning parameters in the subsampling and the
bootstrap procedures.
Assumption D.7. (a) In the subsampling procedure, b−1T + bTT
−1 → 0 and ST →∞, and
(b)In the bootstrap procedure, κ−1T + κTT
−1 → 0 and ST →∞.
D.5 Proof of Lemmas D.1 and D.2
Proof of Lemma D.1. (a) Assumptions D.2(c)-(d) imply that under F ,
∆ρ,T ≡ supθ∈Γ−1(γ),g∈G
||ρT (θ, g)− ρF (θ, g)|| →p 0, and
supθ∈Γ−1(γ),g∈G
||vech(ΣT (θ, g)− ΣF (θ, g))|| →p 0. (D.20)
The second convergence implies that
∆Σ,T ≡ supθ∈Γ−1(γ),g∈G
||vech(ΣιT (θ, g)− Σι
F (θ, g))|| →p 0. (D.21)
47
By Assumption D.2(b), supθ∈Γ−1(γ),g∈G ||ρF (θ, g)|| < M∗ for someM∗ <∞. Thus, (ρF (θ, g),ΣιF (θ, g)) :
(θ, g) ∈ Γ−1(γ) × G is a subset of the compact set [−M∗,M∗]k × Ψ. By Assumption D.2(f) and
Equations (D.20) and (D.21), we have (ρT (θ, g), ΣιT (θ, g)) : (θ, g) ∈ Γ−1(γ) × G ⊆ [−M∗ −
ξ,M∗ + ξ]k ×Ψξ with probability approaching one for any ξ > 0. By Assumption D.6(a), S(m,Σ)
is uniformly continuous on [−M∗,M∗]k ×Ψ. Therefore, for any ε > 0,
Pr F
(∣∣∣∣ minθ∈Γ−1(γ)
QT (θ)− minθ∈Γ−1(γ)
∫GTS(ρF (θ, g),Σι
F (θ, g))dµ(g)
∣∣∣∣ > ε
)≤Pr F
(sup
θ∈Γ−1(γ),g∈G|S(ρt(θ, g), Σι
t(θ, g))− S(ρF (θ, g),ΣιF (θ, g))| > ε
)→0. (D.22)
Now it is left to show that minθ∈Γ−1(γ)
∫GT S(ρF (θ, g),Σι
F (θ, g))dµ(g)→ minθ∈Γ−1(γ)QF (θ) as T →∞. Observe that
0 ≤ minθ∈Γ−1(γ)
QF (θ)− minθ∈Γ−1(γ)
∫GTS(ρF (θ, g),Σι
F (θ, g))dµ(g)
≤ supθ∈Γ−1(γ)
∫G/GT
S(ρF (θ, g),ΣιF (θ, g))dµ(g)
≤∫G/GT
supθ∈Γ−1(γ)
S(ρF (θ, g),ΣιF (θ, g))dµ(g). (D.23)
We have supθ∈Γ−1(γ) S(ρF (θ, g),ΣιF (θ, g)) < ∞, because ρF (θ, g) ∈ [−M∗,M∗]k and Σι
F (θ, g) ∈ Ψ
and Assumption D.6(a). Thus the last line of (D.23) converges to zero under Assumption D.5(a).
This and (D.22) together show part (a).
(b) The first half of part (b), minθ∈Γ−1(γ)QF (θ) ≥ 0, is implied by Assumption D.6(d).
Suppose γ ∈ Γ0,F . Then there exists a θ∗ ∈ Γ−1(γ) such that ρF (θ∗, g) ≥ 0 for all g ∈ G. This
implies that S(ρF (θ∗, g),ΣF (θ∗, g)) = 0 for all g ∈ G by Assumption D.6(d). Thus, QF (θ∗) = 0.
Because minθ∈Γ−1(γ)QF (θ) ≤ QF (θ∗) = 0, this shows the “if” part of the second half.
Suppose that minθ∈Γ−1(γ)QF (θ) = 0. By Assumptions D.1(a)-(b), Γ−1(γ) is compact. By
Assumptions D.2(e) and (f), QF (θ) is continuous in θ. Thus, there exists a θ∗ ∈ Γ−1(γ) such that
QF (θ∗) = minθ∈Γ−1(γ)QF (θ) = 0. We show by contradiction that this implies γ ∈ Γ0,F . Suppose
that γ /∈ Γ0,F . Then it must be that θ∗ /∈ Θ0,F , which implies that ρF,j(θ∗, g∗) < 0 for some g∗ ∈ G
and some j ≤ dm. Then by Assumption D.4, there exists a neighborhood N (g∗) with positive
µ-measure, such that ρF,j(θ∗, g) < 0 for all g ∈ N (g∗). This implies that QF (θ∗) > 0, which
contradicts QF (θ∗) = 0. Thus, the “only if” part is proved.
Proof of Lemma D.2. We prove part (b) only. Part (a) follows from the arguments for part (b)
because the S function in part (a) is the same as the SUM S function with Σ = I. Let ξ be any
positive number less than ξ2. Then the diagonal elements of all matrices in Ψξ are bounded below
by ξ2 − ξ.
48
We prove the SUM part first. Assumptions D.6(a), (c)-(f) are immediate. It suffices to verify
Assumptions D.6(b). To verify Assumption D.6(b), observe that
|S(m1,Σ1)− S(m2,Σ2)| =
∣∣∣∣∣∣k∑j=1
([m1,j/σ1,j ]− − [m2,j/σ2,j ]−)([m1,j/σ1,j ]− + [m2,j/σ2,j ]−)
∣∣∣∣∣∣≤
2k∑j=1
([m1,j/σ1,j ]− − [m2,j/σ2,j ]−)2(S(m1,Σ1) + S(m2,Σ2))
1/2
≡2A(S(m1,Σ1) + S(m2,Σ2))1/2 , (D.24)
where the inequality holds by the Cauchy-Schwartz inequality and the inequality (a+ b)2 ≤ 2(a2 +
b2), and the ≡ holds with A :=∑k
j=1([m1,j/σ1,j ]− − [m2,j/σ2,j ]−)2. Now we manipulate A in the
following way:
A =
k∑j=1
([m1,j/σ1,j ]− − [m2,j/σ1,j ]− + [m2,j/σ1,j ]− − [m2,j/σ2,j ]−)2
≤ 2
k∑j=1
([m1,j/σ1,j ]− − [m2,j/σ1,j ]−)2 + 2
k∑j=1
([m2,j/σ1,j ]− − [m2,j/σ2,j ]−)2
= 2
k∑j=1
([m1,j/σ1,j ]− − [m2,j/σ1,j ]−)2 + 2
k∑j=1
(σ2,j − σ1,j)2[m2,j/σ2,j ]
2−/σ
21,j
≤ 2‖m1 −m2‖2/(ξ2 − ξ) + 2‖vech(Σ1 − Σ2)‖/(ξ2 − ξ)S(m2,Σ2)
≤ 2(ξ2 − ξ)−1(S(m2,Σ2) + 1)(||m1 −m2||2 + ||vech(Σ1 − Σ2)||), (D.25)
where the first inequality holds by the inequality (a + b)2 ≤ 2(a2 + b2) and the second inequality
holds because (σ2,j − σ1,j)2 ≤ |σ2
2,j − σ21,j | ≤ ||vech(Σ1 −Σ2)|| and because σ2
1,j , σ22,j ≥ ξ2 − ξ. Plug
(D.25) in (D.24), and we obtain Assumptions D.6(b).
The proof for the MAX part is the same as the SUM part except some minor changes. The
first and obvious change is to replace all∑k
j=1 involved in the above arguments by maxj=1,...,k .
The second change is to replace the Cauchy-Schwartz inequality used in (D.24) by the inequality
|maxj ajbj | ≤ (maxj a2j ×maxj b
2j )
1/2. The rest of the arguments stay unchanged.
E Proof of Theorem D.1
We first introduce the approximation of TT (γ) that connects the distribution of TT (γ) with those
of the subsampling statistic and the bootstrap statistic. For any θ ∈ Θ0,F (γ), let ΛT (θ, γ) =
λ : θ + λ/√T ∈ Γ−1(γ), d(θ + λ/
√T ,Θ0,F (γ)) = ||λ||/
√T. In words, ΛT (θ, γ) is the set of all
deviations from θ along the fastest paths away from Θ0,F (γ). With this notation handy, we can
49
define the approximation of TT (γ) as follows:
T apprT (γ) = (E.1)
minθ∈Θ0,F (γ)
minλ∈ΛT (θ,γ)
∫GS(νF (θ, g) +GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g))dµ(g).
Theorem E.1 shows that T apprT (γ) approximates TT (γ) asymptotically.
Theorem E.1. Suppose that Assumptions D.1-D.3 and D.5-D.6 hold. Then for any real sequence
xT and scalar η > 0 ,
lim infT→∞
inf(γ,F )∈H0
[Pr F (TT (γ) ≤ xT + η)− Pr(T apprT (γ) ≤ xT )
]≥ 0 and
lim supT→∞
sup(γ,F )∈H0
[Pr F (TT (γ) ≤ xT )− Pr(T apprT (γ) ≤ xT + η)
]≤ 0.
Theorem E.1 is a key step in the proof of Theorem D.1 and is proved in the next sub-subsection.
The remaining proof of Theorem D.1 is given in the subsection after that.
E.1 Proof of Theorem E.1
The following lemma is used in the proof of Theorem E.1. It is a portmanteau theorem for uniform
weak approximation, which is an extension of the portmanteau theorem for (pointwise) weak con-
vergence in Chapter 1.3 of van der Vaart and Wellner (1996). Let (D, d) be a metric space and let
BL1 denote the set of all real functions on D with a Lipschitz norm bounded by one.
Lemma E.1. (a) Let (Ω,B) be a measurable space. Let X(1)T : Ω → D and X(2)
T : Ω → D be
two sequences of mappings. Let P be a set of probability measures defined on (Ω,B). Suppose that
supP∈P supf∈BL1|E∗P f(X
(1)T ) − E∗,P f(X
(2)T )| → 0. Then for any open set G0 ⊆ D and closed set
G1 ⊂ G0, we have
lim infT→∞
infP
[Pr ∗,P (X
(1)T ∈ G0)− Pr ∗P (X
(2)T ∈ G1)
]≥ 0 and
(b) Let (Ω,B) be a product space: (Ω,B) = (Ω1 × Ω2, σ(B1 × B2)). Let P1 be a set of prob-
ability measures defined on (Ω1,B1) and P2 be a probability measure on (Ω2,B2). Suppose that
supP1∈P1Pr ∗P1
(supf∈BL1|E∗P2
f(X(1)T ) − E∗,P2f(X
(2)T )| > ε) → 0 for all ε > 0. Then for any open
set G0 ⊆ D and closed set G0 ⊂ G1, we have for any ε > 0,
lim supT→∞
supP1∈P1
Pr ∗P1(Pr ∗P2
(X(1)T ∈ G1)− Pr ∗,P2(X
(2)T ∈ G0) > ε) = 0.
Proof of Lemma E.1. (a) We first show that there is a Lipschitz continuous function sandwiched
by 1(x ∈ G0) and 1(x ∈ G1). Let fa(x) = (a · d(x,Gc0)) ∧ 1, where Gc0 is the complement of G0.
50
Then fa is a Lipschitz function and fa(x) ≤ 1(x ∈ G0) for any a > 0. Because G1 is a closed subset
of G0, infx∈G1 d(x,Gc0) > c for some c > 0. Let a = c−1 + 1. Then fa(x) ≥ 1(x ∈ G1). Thus, the
function fa(x) is sandwiched between 1(x ∈ G0) and 1(x ∈ F1). Equivalently,
a−11(x ∈ G1) ≤ a−1fa(x) ≤ a−11(x ∈ G0), ∀x ∈ D. (E.2)
By definition, a−1fa(x) ∈ BL1. Using this fact and (E.2), we have
a−1 lim infT→∞
infP∈P
[Pr ∗,P (X
(1)T ∈ G0)− Pr ∗P (X
(2)T ∈ G1)
]= lim inf
T→∞infP∈P
[a−1 Pr ∗,P (X(1)T ∈ G0)− E∗,Pa−1fa(X
(1)T )+
E∗,Pa−1fa(X
(1)T )− E∗Pa−1fa(X
(2)T ) + E∗Pa
−1fa(X(2)T )− a−1 Pr ∗P (X
(2)T ∈ G1)]
≥ lim infT→∞
infP∈P
[E∗,Pa
−1fa(X(1)T )− E∗Pa−1fa(X
(2)T )]
= 0. (E.3)
Therefore, part (a) is established.
(b) Use the same a and fa(x) as above, we have
Pr ∗P2(X
(1)T ∈ G1)− Pr ∗,P2(X
(2)T ∈ G0) ≤ a
[E∗P2
a−1fa(X(1)T )− E∗,P2a
−1fa(X(2)T )]
≤ a supf∈BL1
|E∗,P2f(X(1)T )− E∗P2
f(X(2)T )|. (E.4)
This implies part (b).
Proof of Theorem E.1. We only need to show the first inequality because the second one follows
from the same arguments with TT (γ) and T apprT (γ) flipped.
The proof consists of four steps. In the first step, we show that the truncation of G has
asymptotically negligible effect: for all ε > 0,
lim supT→∞
sup(γ,F )∈H0
Pr F (|TT (γ)− TT (γ)| > ε) = 0, (E.5)
where TT (γ) is the same as TT (γ) except that the integral is over G instead of GT .
In the second step, we define a bounded version of TT (γ): TT (γ;B1, B2) and a bounded version
of T apprT (γ): T apprT (γ;B1, B2) and show that for any B1, B2 > 0 and any real sequence xT ,
lim infT→∞
inf(γ,F )∈H0
[Pr F (TT (γ;B1, B2) ≤ xT + η)− Pr(T apprT (γ;B1, B2) ≤ xT )
]≥ 0. (E.6)
In the third step, we show that TT (γ;B1, B2) is asymptotically close in distribution to TT (γ)
for large enough B1, B2: for any ε > 0, there exists B1,ε and B2,ε such that
lim supT→∞
sup(γ,F )∈H0
Pr F (TT (γ;B1,ε, B2,ε) 6= TT (γ)) < ε. (E.7)
51
In the fourth step, we show that T apprT (γ;B1, B2) is asymptotically close in distribution to T apprT (γ)
for large enough B1, B2: for any ε > 0, there exists B1,ε and B2,ε such that
lim supT→∞
sup(γ,F )∈H0
Pr F (T apprT (γ;B1,ε, B2,ε) 6= T apprT (γ)) < ε. (E.8)
The four steps combined proves the Theorem. Now we give detailed arguments of the four steps.
STEP 1. First we show a property of the function S that is useful throughout all steps: for
any (m1,Σ1) and (m2,Σ2) ∈ Rk ×Ψξ,
|S(m1,Σ1)− S(m2,Σ2)| ≤ C2 × (S(m2,Σ2) + 1)(∆ +√
∆2 + 8∆)/2, (E.9)
for the ∆ and C in Assumption D.6(b). Let ∆S := |S(m1,Σ1) − S(m2,Σ2)|. Assumption D.6(b)
implies that
∆2S ≤ C2 × (S(m1,Σ1) + S(m2,Σ2))(S(m2,Σ2) + 1)∆
≤ C2 × (∆S + 2S(m2,Σ2))(S(m2,Σ2) + 1)∆. (E.10)
Solve the quadratic inequality for ∆S , we have
∆S ≤C2
2
[(S(m2,Σ2) + 1)∆ +
√(S(m2,Σ2) + 1)2∆2 + 8S(m2,Σ2)(S(m2,Σ2) + 1)∆
]≤ C2
2(S(m2,Σ2) + 1)(∆ +
√∆2 + 8∆) (E.11)
This shows (E.9).
52
Now observe that
0 ≤ TT (γ)− TT (γ)
≤ supθ∈Γ−1(γ)
∫G/GT
S(√T ρT (θ, g), Σι
T (θ, g))dµ(g)
≤ supθ∈Γ−1(γ)
∫G/GT
S(√TρF (θ, g),Σι
F (θ, g))dµ(g)+
supθ∈Γ−1(γ)
∫G/GT
|S(√TρF (θ, g),Σι
F (θ, g))− S(√T ρT (θ, g), Σι
T (θ, g))|dµ(g)
= o(1) + supθ∈Γ−1(γ)
∫G/GT
|S(√TρF (θ, g),Σι
F (θ, g))− S(√T ρT (θ, g), Σι
T (θ, g))|dµ(g)
≤ o(1) + supθ∈Γ−1(γ)
∫G/GT
C2 ×(S(√TρF (θ, g),Σι
F (θ, g)) + 1)dµ(g)×
supθ∈Γ−1(γ),g∈G/GT
c(||νT (θ, g)||2 + ||vech(Σι
F (θ, g)− ΣιT (θ, g))||
)= o(1) + o(1)× c(Op(1))
= op(1), (E.12)
where c(x) = x+√x2 + 8x/2, the third inequality holds by the triangle inequality, the first equality
holds by Assumption D.5(b), the fourth inequality holds by (E.9) and the second equality holds by
Assumptions D.5(a)-(b) and D.2(c)-(d). The o(1), op(1) and Op(1) are uniform over (γ, F ) ∈ H.
Thus, (E.5) is shown.
STEP 2. We define the bounded versions of TT (γ) as
TT (γ;B1, B2) = minθ∈Θ0,F (γ)
minλ∈Λ
B2T (θ,γ)∫
GS(νB1
T (θ + λ/√T , g) +GF (θT , g)λ+
√TρF (θ, g), Σι
T (θ + λ/√T , g))dµ(g) (E.13)
where ΛB2T (θ, γ) = λ ∈ ΛT (θ, γ) : TQF (θ + λ/
√T ) ≤ B2, the process νB1
T (·, ·) = max−B1,
minB1, νT (·, ·) and θT is a value lying on the line segment joining θ and θ + λ/√T satisfying
the mean value expansion:
ρF (θ + λ/√T , g) = ρF (θ, g) +GF (θT , g)λ/
√T . (E.14)
Define the bounded version of T apprT (γ) as
T apprT (γ;B1, B2) = (E.15)
minθ∈Θ0,F (γ)
minλ∈Λ
B2T (θ,γ)
∫GS(νB1
F (θ, g) +GF (θ, g)λ+√TρF (θ, g),Σι
F (θ, g))dµ(g),
53
where νB1F (·, ·) = max−B1,minB1, νF (·, ·).
First we show a useful result: there exists some constant C > 0 such that for all (γ, F ) ∈ H0
and λ ∈ ΛB2T (θ, γ) and for the δ2 in Assumption D.3(b), we have
||λ|| ≤ C × T (δ2−2)/(2δ2). (E.16)
This is shown by observing, for all (γ, F ) ∈ H0 and λ ∈ ΛB2T (θ, γ),
B2 >TQF (θ + λ/√T )
≥C · ((T × d(θ + λ/√T ,Θ0,F (γ))δ2) ∧ (c× T )). (E.17)
The second inequality holds by Assumption (D.3)(b). Because c× T is eventually greater than B2
as T →∞, we have for large enough T ,
B2 ≥ C × T × (||λ||/√T )δ2 . (E.18)
This implies (E.16).
Equation (E.16) implies two results:
(1) sup(γ,F )∈H0
supθ∈Θ0,F (γ)
supλ∈Λ
B2T (θ,γ)
‖λ‖/√T ≤ O(T−1/δ2) = o(1)
(2) sup(γ,F )∈H0
supθ∈Θ0,F (γ)
supλ∈Λ
B2T (θ,γ)
supg∈G‖GF (θ +O(‖λ‖)/
√T , g)λ−GF (θ, g)λ‖
≤ O(1)× ‖λ‖δ1+1T−δ1/2 ≤ O(T (δ2−2(δ1+1))/(2δ2)) = o(1). (E.19)
The first result holds immediately given (E.16) and the second result holds by Assumption D.2(e).
Define an intermediate statistic
TmedT (γ;B1, B2) = minθ∈Θ0,F (γ)
minλ∈Λ
B2T (θ,γ)∫
GS(νB1
T (θ, g) +GF (θ, g)λ+√TρF (θ, g),Σι
F (θ, g))dµ(g). (E.20)
Then TmedT (γ;B1, B2) and T apprT (γ;B1, B2) are respectively the following functional evaluated at
νF (·, ·) and νT (·, ·):
h(ν) = minθ∈Θ0,F (γ)
minλ∈Λ
B2T (θ,γ)
∫GS(νB1(θ, ·) +GF (θ, ·)λ+
√TρF (θ, ·),Σι
F (θ, ·))dµ. (E.21)
The functional h(ν) is uniformly bounded for all large enough T because for any fixed θ ∈ Θ0,F (γ)
54
and λ ∈ ΛB2T (θ, γ),
h(ν) ≤ 2
∫GS(GF (θ, ·)λ+
√TρF (θ, ·),Σι
F (θ, ·))dµ+ 2
∫GS(νB1(θ, ·),Σι
F (θ, ·))dµ
≤ 2 supΣ∈Ψ
S(−B11k,Σ) + 2
∫GS(GF (θ, ·)λ+
√TρF (θ, ·),Σι
F (θ, ·))dµ
≤ 2 supΣ∈Ψ
S(−B11k,Σ) + 2T ×QF (θ + λ/√T )+
C2 × (T ×QF (θ + λ/√T ) + 1) sup
g∈G(∆T (g) +
√∆T (g)2 + 8∆T (g))
≤ 2 supΣ∈Ψ
S(−B11k,Σ) + 2B2 + C2(B2 + 1)× o(1), (E.22)
where ∆T (g) := ‖GF (θ, g)λ +√TρF (θ, g) −
√TρF (θT , g)‖2 + ‖vech(Σι
F (θ, g) − ΣιF (θT , g))‖ and
θT = θ + λ/√T . The first inequality holds by Assumptions D.6(e)-(f), the second inequality holds
by Assumptions D.2(f) and Assumptions D.6(c), the third inequality holds by (E.9) and the last
inequality holds by (E.19).
The functional h(ν) is Lipschitz continuous for all large enough T with respect to the uniform
metric because
|h(ν1)− h(ν2)| ≤ 2C supθ∈Θ0,F (γ)
supλ∈Λ
B2T (θ,γ)
supg∈G‖ν1(θ, g)− ν2(θ, g)‖ · (1 + h(ν1) + 2h(ν2))
≤ C supθ∈Γ−1(γ),g∈G
‖ν1(θ, g)− ν2(θ, g)‖, (E.23)
where C is any constant such that C > 2C× (6 supΣ∈Ψ S(−B11k,Σ) + 6B2 + 1), the first inequality
holds by Assumption D.6(b) and the second holds by (E.22).
Therefore, for any f ∈ BL1 and any real sequence xT , the composite function f (C−1h(·) +
xT ) ∈ BL1. By AssumptionD.2(c), we have
lim supT→∞
sup(γ,F )∈H0
supf∈BL1
|EF f(TmedT (γ;B1, B2) + xT )− Ef(T apprT (γ;B1, B2) + xT )| = 0. (E.24)
This combined with Lemma E.1(a) (with G0 = (−∞, η) and G1 = (−∞, 0]) gives
lim infT→∞
inf(γ,F )∈H0
[Pr F (TmedT (γ;B1, B2) ≤ xT + η)− Pr(T apprT (γ;B1, B2) ≤ xT )
]≥ 0. (E.25)
55
Now it is left to show that TmedT (γ;B1, B2) and TT (γ;B1, B2) are close. First, we have
|TT (γ;B1, B2)− TmedT (γ;B1, B2)|
≤ supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ)
∫G
∣∣∣S(νB1T (θ + λ/
√T , g) +GF (θT , g)λ+
√TρF (θ, g), Σι
T (θ + λ/√T , g))
−S(νB1T (θ, g) +GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g))∣∣∣ dµ(g)
≤C2 × supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ)
maxg∈G
c(∆T (θ, λ, g))×∫G(1 +MT (θ, λ, g))dµ(g), (E.26)
where c(x) = (x+√x2 + 8x)/2, C is the constant in (E.9),
∆T (θ, λ, g) =‖νB1T (θ + λ/
√T , g)− νB1
T (θ, g) +GF (θT , g)λ−GF (θ, g)λ‖2+
‖vech(ΣT (θ + λ/√T , g)− ΣF (θ, g))‖ and
MT (θ, λ, g) =S(νB1T (θ, g) +GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g)). (E.27)
Below we show that for any ε > 0, and some universal constant C > 0,
sup(γ,F )∈H0
Pr F
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ),g∈G
∆T (θ, λ, g) > ε
→ 0 and (E.28)
supT
sup(γ,F )∈H0
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ)
∫GMT (θ, λ, g)dµ(g) < C. (E.29)
Once (E.28) and (E.29) are shown, it is immediate that for any ε > 0,
sup(γ,F )∈H0
Pr F
(|TT (γ;B1, B2)− TmedT (γ;B1, B2)| > ε
)→ 0. (E.30)
This combined with (E.25) shows (E.6).
Now we show (E.28) and (E.29). The convergence result (E.28) is implied by the following
results: for any ε > 0,
sup(γ,F )∈H0
Pr F
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ),g∈G
||νB1T (θ + λ/
√T , g)− νB1
T (θ, g)|| > ε
→ 0
sup(γ,F )∈H0
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ),g∈G
||GF (θT , g)λ−GF (θ, g)λ|| → 0 and
sup(γ,F )∈H0
Pr F
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ),g∈G
||vech(ΣT (θ + λ/√T , g)− ΣF (θ, g))|| > ε
→ 0. (E.31)
The first result in the above display holds by the first result in equation (E.19) and the uniform
stochastic equicontinuity of the empirical process νT (·, g) : Γ−1(γ) → Rdm with respect to the
56
Euclidean metric. The uniform equicontinuity is implied by Assumptions D.2(b), (c) and (f) by
Theorem 2.8.2 of van der Vaart and Wellner (1996). The second result in the above display holds
by the second result in (E.19). The third result in (E.31) holds by Assumption D.2(d) and (f).
Result (E.29) holds because for any θ ∈ Θ0,F (γ) and λ ∈ ΛB2T (θ, γ),∫
GMT (θ, λ, g)dµ(g)
≤2
∫GS(νB1
T θ, g),ΣιF (θ, g))dµ(g) + 2
∫GS(GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g))dµ(g)
≤ supΣ∈Ψ
S(−B11k,Σ) + 2
∫GS(GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g))dµ(g)
≤ supΣ∈Ψ
S(−B11k,Σ) + 2B2 + C2(B2 + 1)× o(1), (E.32)
where the first inequality holds by Assumptions D.6(f), the second inequality holds by Assumption
D.6(c) and the last inequality holds by the second and third inequality in (E.22) and the o(1) is
uniform over (θ, λ).
STEP 3. In order to show (E.7), first extend the definition of TT (γ;B1, B2) from Step 1 to
allow B1 and B2 to take the value ∞ and observe that TT (γ;∞,∞) = TT (γ).
Assumptions D.2 (c) and Lemma E.1 imply that for any ε > 0, there exists B1,ε large enough
such that
lim supT→∞
sup(γ,F )∈H0
Pr F
(sup
θ∈Θ,g∈G‖νT (θ, g)‖ > B1,ε
)< ε. (E.33)
Therefore we have for all B2,
lim supT→∞
sup(γ,F )∈H0
Pr F(TT (γ,∞, B2) 6= TT (γ;B1,ε, B2)
)< ε. (E.34)
To show that T T (γ) and TT (γ;∞, B2) are close for B2 large enough, first observe that:
T T (γ) ≤ supθ∈Θ0,F (γ)
∫GS(νT (θ, g) +
√TρF (θ, g), Σι
T (θ, g))dµ(g)
≤ supθ∈Θ0,F (γ)
∫GS(νT (θ, g), Σι
T (θ, g))dµ(g)
= Op(1) (E.35)
where the first inequality holds because 0 ∈ ΛT (θ, γ), the second inequality holds because ρF (θ, g) ≥0 for θ ∈ Θ0,F (γ) and by Assumption D.6(c), the equality holds by Assumption D.6(a)-(c) and
Assumptions D.2 (c), (d) and (f). The Op(1) is uniform over (γ, F ) ∈ H0.
For any T , γ, B2, if T T (γ) 6= TT (γ;∞, B2), then there must be a θ∗ ∈ Γ−1(γ) such that
57
T ×QF (θ∗) > B2 and∫GS(νT (θ∗, g) +
√TρF (θ∗, g), Σι
T (θ∗, g))dµ(g) < Op(1). (E.36)
But ∫GS(νT (θ∗, g) +
√TρF (θ∗, g), Σι
T (θ∗, g))dµ(g)
≥2−1
∫GS(√TρF (θ∗, g), Σι
T (θ∗, g))dµ(g)−∫GS(−νT (θ∗, g), Σι
T (θ∗, g))dµ(g)
≥2−1
∫GS(√TρF (θ∗, g), Σι
T (θ∗, g))dµ(g)−Op(1)
≥2−1
[TQF (θ∗)−
∫G|S(√TρF (θ∗, ·), Σι
T (θ∗, ·))− S(√TρF (θ∗, ·),Σι
F (θ∗, ·))|dµ]−Op(1)
≥2−1
[TQF (θ∗)− C2 sup
g∈Gc(||vech(Σι
T (θ∗, g)− ΣιF (θ∗, g))||)× (1 + TQF (θ∗))
]−Op(1)
=B2/2− o(1)− op(1)× C2 ×B2/4−Op(1), (E.37)
where c(x) = (x +√x2 + 8x)/2 and C is the constant in (E.9). The first inequality holds by
Assumptions D.6(e)-(f), the second inequality holds by Assumption D.6(c) and Assumptions D.2(c)-
(d) and (f), the third inequality holds by the triangle inequality, the fourth inequality holds by (E.9)
and the equality holds by Assumption D.2(d). The terms o(1), op(1) and Op(1) terms are uniform
over θ∗ ∈ Γ−1(γ) and (γ, F ) ∈ H0.
Then
sup(γ,F )∈H0
Pr F
(T T (γ) 6= TT (γ;∞, B2)
)≤ sup
(γ,F )∈H0
Pr F(2−1(1− op(1))×B2 − o(1)−Op(1) ≤ Op(1)
)= sup
(γ,F )∈H0
Pr F (Op(1) ≥ B2) , (E.38)
where the first inequality holds by (E.36) and (E.37). Then for any ε, there exists B2,ε such that
limT→∞
sup(γ,F )∈H0
Pr F (TT (γ) 6= TT (γ;∞, B2,ε)) < ε. (E.39)
Combining this with (E.34), we have (E.7).
STEP 4. In order to show (E.8), first extend the definition of T apprT (γ;B1, B2) from Step 1 to
allow B1 and B2 to take the value ∞ and observe that T apprT (γ;∞,∞) = T apprT (γ).
By the same arguments as those for (E.34), for any ε and B2, there exists B1,ε large enough so
58
that
lim supn→∞
sup(γ,F )∈H0
Pr F(T apprT (γ;∞, B2) 6= T apprT (γ;B1,ε, B2)
)< ε. (E.40)
Also by the same reasons as those for (E.35), we have
T apprT (γ) ≤ supθ∈Θ0,F (γ)
∫GS(νF (θ, g),Σι
F (θ, g))dµ(g), (E.41)
where the right hand side is a real-valued random variable.
For any T and B2, if T apprT (γ) 6= T apprT (γ;∞, B2,ε), then there must be a θ∗ ∈ Θ0,F (γ), a
λ∗∗ ∈ λ ∈ ΛT (θ∗, γ) : T ×QF (θ∗ + λ/√T ) > B2 such that
I(λ∗∗) < supθ∈Θ0,F (γ)
∫GS(νF (θ, g),Σι
F (θ, g))dµ(g), (E.42)
where I(λ) =∫G S(νF (θ∗, g)+GF (θ∗, g)λ+
√TρF (θ∗, g),Σι
F (θ∗, g))dµ(g). Next we show that if λ∗∗
exists, then there must exists a λ∗ such that
λ∗ ∈ λ ∈ ΛT (θ∗, γ) : T ×QF (θ∗ + λ/√T ) ∈ (B2, 2B2] and
I(λ∗) < supθ∈Θ0,F (γ)
∫GS(νF (θ, g),Σι
F (θ, g))dµ(g). (E.43)
If T ×QF (θ∗+λ∗∗/√T ) ∈ (B2, 2B2], then we are done. If T ×QF (θ∗+λ∗∗/
√T ) > 2B2, there must
be a a∗ ∈ (0, 1) such that T ×QF (θ∗+ a∗λ∗∗/√T ) ∈ (B2, 2B2] because TQF (θ∗+ 0×λ∗∗/
√T ) = 0
and TQF (θ∗ + aλ∗∗/√T ) is continuous in a (by Assumptions D.2(e) and D.6(a)). By Assump-
tion D.6(f), I(λ) is convex. Thus I(a∗λ∗∗) ≤ a∗I(λ∗∗) + (1 − a∗)I(0). For the same argu-
ments as those for (E.35), I(0) ≤ supθ∈Θ0,F (γ)
∫G S(νF (θ, g),Σι
F (θ, g))dµ(g). Thus, I(a∗λ∗∗) <
supθ∈Θ0,F (γ)
∫G S(νF (θ, g),Σι
F (θ, g))dµ(g). Assumption (D.1)(c) and the definition of ΛT (θ, γ) guar-
antee that a∗λ∗∗ ∈ ΛT (θ∗, γ). Therefore, λ∗ = a∗λ∗∗ satisfies (E.43).
Similar to (E.19) we have
(1) ||λ∗||/√T ≤ B2 × 2C × T−1/δ2 = B2 × o(1)
(2) supg∈G||GF (θ∗ +O(||λ∗||)/
√T , g)λ∗ −GF (θ∗, g)λ∗||
≤ O(1)×B(δ1+1)/δ22 ||λ||δ1+1T−δ1/2 = B
(δ1+1)/δ22 o(1), (E.44)
59
where the o(1) terms do not depend on B2. Then,
I(λ∗) ≥ 2−1
∫GS(GF (θ∗, g)λ∗ +
√TρF (θ∗, g),Σι
F (θ∗, g))dµ(g)−∫GS(−νF (θ∗, g),Σι
F (θ∗, g))dµ(g)
≥ TQF (θ∗ + λ∗/√T )/2− C2 × (TQF (θ∗ + λ∗/
√T ) + 1)× c(∆T )/2 +Op(1)
= TQF (θ∗ + λ∗/√T )/2− C2 × (2B2 + 1)× c(∆T )/4 +Op(1), (E.45)
where the Op(1) term is uniform over (γ, F ) ∈ H0, c(x) = (x+√x2 + 8x)/2 and
∆T := ‖GF (θ∗, g)λ∗ +√TρF (θ∗, g)−
√TρF (θ∗ + λ∗/
√T , g)‖2
+ ‖vech(ΣιF (θ∗ + λ∗/
√T , g)− Σι
F (θ∗, g))‖. (E.46)
The first inequality in (E.45) holds by Assumptions D.6(e)-(f), the second inequality holds by
(E.9) and the equality holds by (E.43). By (E.44) and Assumption D.2(f), for any fixed B2,
limT→∞∆T = 0. Therefore, for each fixed B2,
I(λ∗) ≥ TQF (θ∗ + λ∗/√T )/2−Op(1) ≥ B2/2−Op(1). (E.47)
Thus
sup(γ,F )∈H0
Pr(T apprT (γ) 6= T apprT (γ;∞, B2))
≤ sup(γ,F )∈H0
Pr
(sup
θ∈Θ0,F (γ)
∫GS(νF (θ, g),Σι
F (θ, g))dµ(g) ≥ B2/2−Op(1)
)= sup
(γ,F )∈H0
Pr(Op(1) ≥ B2). (E.48)
For any ε > 0, there exists B2,ε large enough so that limT→∞ sup(γ,F )∈H0Pr(Op(1) ≥ B2) < ε.
Thus,
limT→∞
sup(γ,F )∈H0
Pr(T apprT (γ) 6= T apprT (γ;∞, B2,ε) < ε. (E.49)
Combining this with (E.40), we have (E.8).
E.2 Proof of Theorem D.1
The following lemma is used in the proof of Theorem D.1. It shows the convergence of the bootstrap
empirical process ν∗T (θ, g). Let WT,t be the number of times that the tth observation appearing in
a bootstrap sample. Then (WT,1, ...,WT,T ) is a random draw from a multinomial distribution with
60
parameters T and (T−1, ..., T−1), and ν∗T (θ, g) can be written as
ν∗T (θ, g) = T−1/2T∑t=1
(WT,t − 1)ρ(wt, θ, g). (E.50)
In the lemma, the subscripts F and W for E and Pr signify the fact that the expectation and the
probabilities are taken with respect to the randomness in the data and the randomness in WT,trespectively.
Lemma E.2. Suppose that Assumption D.2 holds. Then for any ε > 0,
(a)lim supT→∞ sup(γ,F )∈H Pr ∗F (supf∈BL1|EW f(ν∗T (·, ·))− Ef(νF (·, ·))| > ε) = 0,
(b) there exists Bε large enough such that
lim supT→∞
sup(γ,F )∈H
Pr ∗F
(PrW
(sup
θ∈Γ−1(γ),g∈G||ν∗T (θ, g)|| > Bε
)> ε
)= 0, and
(c) there exists δε small enough such that
lim supT→∞
sup(γ,F )∈H
Pr ∗F
(PrW
(supg∈G
sup||θ(1)−θ(2)||≤δε
||ν∗T (θ(1), g)− ν∗T (θ(2), g)|| > ε
)> ε
)= 0.
Proof of Lemma E.2. (a) Part (a) is proved using a combination of the arguments in Theorem 2.9.6
and Theorem 3.6.1 in van der Vaart and Wellner (1996). Take a Poisson number NT with mean
T and independent from the original sample. Then WNT ,1, ...,WNT ,T are i.i.d. Poisson variables
with mean one. Let the Poissonized version of ν∗T (θ, g) be
νpoiT (θ, g) = T−1/2T∑t=1
(WNT ,t − 1)ρ(wt, θ, g). (E.51)
Theorem 2.9.6 in van der Vaart and Wellner (1996) is a multiplier central limit theorem that shows
that if ρ(wt, θ, g) : (θ, g) ∈ Θ × G is F -Donsker and pre-Gaussian, then νpoiT (θ, g) converges
weakly to νF (θ, g) conditional on the data in outer probability. The arguments of Theorem 2.9.6
remain valid if we strengthen the F -Donsker and pre-Gaussian condition to the uniform Donsker
and pre-Gaussian condition of Assumption D.2(c) and strengthen the conclusion to uniform weak
convergence:
lim supT→∞
sup(γ,F )∈H
Pr ∗F
(supf∈BL1
|EW f(νpoiT (·, ·))− Ef(νF (·, ·))| > ε
)= 0, (E.52)
In particular, the extension to the uniform versions of the first and the third displays in the proof
of Theorem 2.9.6 in van der Vaart and Wellner (1996) is straightforward. To extend the second
display, we only need to replace Lemma 2.9.5 with Proposition A.5.2 – a uniform central limit
theorem for finite dimensional vectors.
61
Theorem 3.6.1 in van der Vaart and Wellner (1996) shows that, under a fixed (γ, F ), the bounded
Lipschitz distance between νpoiT (θ, g) and ν∗T (θ, g) converge to zero conditional on (outer) almost
all realizations of the data. The arguments remain valid if we strengthen the Glivenko-Cantelli
assumption used there to uniform Glivenko-Cantelli (which is implied by Assumption D.2(c)) and
strengthen the conclusion to: for all ε > 0
lim supT→∞
sup(γ,F )∈H
Pr ∗F
(supf∈BL1
|EW f(νpoiT (·, ·))− EW f(ν∗T (·, ·))| > ε
)= 0, (E.53)
Equations (E.52) and (E.53) together imply part (a).
(b) Part (b) is implied by part (a), Lemma E.1(b) and the uniform pre-Gaussianity assumption
(Assumption D.2(c)). When applying Lemma E.1(b), consider X(1)T = ν∗T , X
(2)T = νF , G1 = ν :
supθ,g ‖ν(θ, g)‖ ≥ Bε, and G2 = ν : supθ,g ‖ν(θ, g)‖ > Bε − 1 where Bε satisfies:
sup(γ,F )∈H
Pr
(sup
θ∈Θ,g∈G‖νF (θ, g)‖ > Bε − 1
)< ε/2. (E.54)
Such a Bε exists because ρ(wt, θ, g) : (θ, g) ∈ Θ × G is uniformly pre-Gaussian by Assumption
D.2(c).
(c) Part (c) is implied by part (a), Lemma E.1(b) and the uniform pre-Gaussianity assumption
(Assumption D.2(c)). When applying Lemma E.1(b), consider X(1)T = ν∗T , X
(2)T = νF , G1 =
ν : sup||θ(1)−θ(2)||≤∆ε,g‖ν(θ(1), g) − ν(θ(2), g)‖ ≥ ε, and G0 = ν : sup||θ(1)−θ(2)||≤∆ε,g
‖ν(θ(1), g) −ν(θ(2), g)‖ > ε/2, where ∆ε satisfies:
sup(γ,F )∈H
Pr
(sup
‖θ(2)−θ(2)‖≤∆ε,g
‖νF (θ(1), g)− νF (θ(2), g)‖ > ε/2
)< ε/2. (E.55)
Such a ∆ε exists because ρ(wt, θ, g) : (θ, g) ∈ Θ× G is uniformly pre-Gaussian.
Proof of Theorem D.1. (a) Let qapprbT(γ, p) denotes the p quantile of T apprbT
(γ). Let η2 = η∗/3. Below
we show that,
lim supT→∞
sup(γ,F )∈H0
PrF,sub(csubT (γ, p) ≤ qapprbT
(γ, p) + η2) = 0. (E.56)
where Pr ∗F,sub signifies the fact that there are two sources of randomness in csubT (γ, p) one from the
62
original sampling and the other from the subsampling. Once (E.56) is established, we have,
lim infT→∞
inf(γ,F )∈H0
PrF,sub
(TT (γ) ≤ csubT (γ, p)
)≥ lim inf
T→∞inf
(γ,F )∈H0
PrF
(TT (γ) ≤ qapprbT
(γ, p) + η2
)≥ lim inf
T→∞inf
(γ,F )∈H0
[PrF
(TT (γ) ≤ qapprbT
(γ, p) + η2
)− Pr
(T apprT (γ) ≤ qapprbT
(γ, p))]
+ lim infT→∞
inf(γ,F )∈H0
[Pr(T apprT (γ) ≤ qapprbT
(γ, p))− Pr
(T apprbT
(γ) ≤ qapprbT(γ, p)
)]+ lim inf
T→∞inf
(γ,F )∈H0
Pr(T apprbT
(γ) ≤ qapprbT(γ, p)
)(E.57)
≥ p,
where the first inequality holds by (E.56). The third inequality holds because the first two lim infs
after the second inequality are greater than or equal to zero and the third is greater than or equal
to p. The first lim inf is greater than or equal to zero by Theorem E.1. The second lim inf is greater
than or equal to zero T apprbT(γ) ≥ T apprT (γ) for any γ and T which holds because
√T ≥
√bT and
ΛbT (θ, γ) ⊆ ΛT (θ, γ) for large enough T by Assumptions D.1(c) and D.7(c).
Now it is left to show (E.56). In order to show (E.56), we first show that the c.d.f. of T apprbT(γ)
is close to the following empirical distribution function:
LT,bT (x; γ) = S−1T
ST∑s=1
1(T sT,bT (γ) ≤ x
). (E.58)
Define an intermediate quantity first:
LT,bT (x; γ) = q−1T
qT∑l=1
1(T lT,bT (γ) ≤ x
), (E.59)
where qT = ( TbT ) and (T lT,bT (γ))qTl=1 are the subsample statistics computed using all qT possible
subsamples of size bT of the original sample. Conditional on the original sample, (T sT,bT (γ))STs=1 is
ST i.i.d. draws from LT,bT (·; γ). By the uniform Glivenko-Cantelli theorem, for any ε > 0,
lim supT→∞
sup(γ,F )∈H0
Pr F,sub
(supx∈R
∣∣∣LT,bT (x; γ)− LT,bT (x; γ)∣∣∣ > ε
)= 0 (E.60)
It is implied by a Hoeffding’s inequality (Theorem A on page 201 of Serfling (1980a)) for U-statistics
that for any real sequence xT , and ε > 0,
lim supT→∞
sup(γ,F )∈H0
PrF
(LT,bT (xT ; γ)− PrF
(T lT,bT (γ) ≤ xT
)> ε)
= 0. (E.61)
63
Equations (E.60) and (E.61) imply that, for any real sequence xT and ε > 0,
lim supT→∞
sup(γ,F )∈H0
PrF,sub
(LT,bT (xT ; γ)− PrF
(T lT,bT (γ) ≤ xT
)> ε)
= 0. (E.62)
Apply Theorem E.1 on the subsample statistic T lT,bT (γ), and we have for any ε > 0 and any
real sequence xT ,
lim supT→∞
sup(γ,F )∈H0
[PrF
(T lT,bT (γ) ≤ xT − ε
)− Pr
(T apprbT
(γ) ≤ xT)]
< 0. (E.63)
Equations (E.62) and (E.63) imply that for any real sequence xT ,
sup(γ,F )∈H0
PrF,sub
(LT,bT (xT ; γ) >
(η2 + Pr
(T apprbT
(γ) ≤ xT + η2
)))→ 0. (E.64)
Plug xT = qapprbT(γ, p)− 2η2 into the above equation and we have:
lim supT→∞
sup(γ,F )∈H0
Pr∗F,sub
(LT,bT (qapprbT
(γ, p)− 2η2; γ) > η2 + p)
= 0. (E.65)
However, by the definition of csubT (γ, p), LT,bT (csubT (γ, p)− η∗; γ) ≥ p+ η∗ > η2 + p. Therefore
lim supn→∞
sup(γ,F )∈H0
Pr∗F,sub
(LT,bT (qapprbT
(γ, p)− 2η2; γ) ≥ LT,bT (csubT (γ, p)− η∗; γ))
=0, (E.66)
which implies (E.56).
(b) Let qbtκT (γ, p) be the p quantile of T apprκT (γ) conditional on the original sample. Below we
show that for η2 = η∗/3,
lim supT→∞
sup(γ,F )∈H0
PrF,W (cbtT (γ, p) < qbtκT (γ, p) + η2) = 0. (E.67)
where Pr F,W signifies the fact that there are two sources of randomness in cbtT (γ, p), that from the
original sampling and that from the bootstrap sampling. Once (E.67) is established, we have,
lim infT→∞
inf(γ,F )∈H0
PrF,W
(TT (γ) ≤ cbtT (γ, p)
)≥ lim inf
T→∞inf
(γ,F )∈H0
PrF
(TT (γ) ≤ qbtκT (γ, p) + η2
)≥ lim inf
T→∞inf
(γ,F )∈H0
Pr(T apprT (γ) ≤ qbtκT (γ, p)
)≥ lim inf
T→∞inf
(γ,F )∈H0
Pr(T apprκT
(γ) ≤ qbtκT (γ, p))
= p, (E.68)
where the first inequality holds by (E.67), the second inequality holds by Theorem E.1 and the third
inequality holds because T apprκT (γ) ≥ T apprT (γ) for any γ and T which holds because√T ≥ √κT and
64
ΛκT (θ, γ) ⊆ ΛT (θ, γ) for large enough T by Assumptions D.1(c) and D.7(c).
Now we show (E.67). First, we show that the c.d.f. of T apprκT (γ) is close to the following empirical
distribution:
FST (x, γ) = S−1T
ST∑l=1
1T ∗T,l(γ) ≤ x, (E.69)
where T ∗T,1(γ), ..., T ∗T,ST (γ) are the ST conditionally independent copies of the bootstrap test
statistics. By the uniform Glivenko-Cantelli Theorem, FST (x, γ) is close to conditional c.d.f. of
T ∗T (γ): for any η > 0
lim supT→∞
sup(γ,F )∈H0
Pr F,W
(supx∈R|FSn(x, γ)− PrW (T ∗T (γ) ≤ x)| > η
)= 0. (E.70)
The same arguments as those for Theorem E.1 can be followed to show that T ∗T (γ) is close in
law to T apprκT (γ) in the following sense: for any real sequence xT ,
lim supT→∞
sup(γ,F )∈H0
Pr F([
PrW (T ∗T (γ) ≤ xT − η2)− Pr(T apprκT(γ) ≤ xT )
]≥ η2
)= 0. (E.71)
When following the arguments for Theorem E.1, we simply need to observe the resemblance between
TT (γ) and T ∗T (γ) in the following form:
T ∗T (γ) = minθ∈Θ0,F (γ)
minλ∈ΛκT (θ,γ)∫
GS(ν∗+T (θ + λ/
√T , g) +GF (θT , g)λ+
√κTρF (θ, g), Σn(θ + λ/
√T , g))dµ(g), (E.72)
where
ν∗+T (θ, g) = ν∗T (θ, g) + κ1/2T n−1/2νT (θ, g), (E.73)
and use Lemma E.2 in conjunction with Assumptions D.2(c) and use Lemma E.1(b) in place of
E.1(a).
Equations (E.70) and (E.71) together imply that for any real sequence xT ,
lim supT→∞
sup(γ,F )∈H0
Pr F,W([FST (xT − η2, γ)− Pr(T apprκT
(γ) ≤ xT )]≥ 2η2
)= 0. (E.74)
Plug in xT = qapprκT (γ, p)− η2 and we have
lim supT→∞
sup(γ,F )∈H0
Pr F,W(FST (qapprκT
(γ, p)− 2η2, γ) ≥ p+ 2η2
)= 0. (E.75)
But by definition, FST (cbtT (γ, p)− η∗, γ) ≥ p+ η∗ > p+ 2η2. Therefore,
lim supT→∞
sup(γ,F )∈H0
Pr F,W
(FST (qapprκT
(γ, p)− 2η2, γ) ≥ FST (cbtT (γ, p)− η∗, γ))
= 0, (E.76)
65
which implies (E.67).
References
Andrews, D. W. K., and X. Shi (2013): “Inference Based on Conditional Moment Inequality
Models,” Econometrica, 81.
Bugni, F. A., I. A. Canay, and X. Shi (2016): “Inference for Subvectors and Other Functions
of Partially Identified Parameters in Moment Inequality Models,” Quantitative Economics.
Chernozhukov, V., H. Hong, and E. Tamer (2007): “Estimation and Confidence Regions for
Parameter Sets in Econometric Models,” Econometrica, 75, 1243–1284.
Romano, J., and A. Shaikh (2008): “Inference for Identifiable Parameters in Partially Identified
Econometric Models,” Journal of Statistical Planning and Inference, (Special Issue in Honor of
T. W. Anderson, Jr. on the Occasion of his 90th Birthday), 138, 2786–2807.
Serfling, R. J. (1980a): Approximation Theorems in Mathematical Statistics. John Wiley and
Sons, INC.
van der Vaart, A., and J. Wellner (1996): Weak Convergence and Empirical Processes: with
Applications to Statistics. Springer.
66