Estimating Demand for Di erentiated Products with Zeroes ...xshi/research/gandhi_lu_shi.pdf ·...

Estimating Demand for Differentiated Products with Zeroes in

Market Share Data∗

Amit Gandhi

UW-Madison

Microsoft

Zhentong Lu

SUFE

Xiaoxia Shi †

UW-Madison

April 18, 2017

Abstract

In this paper we introduce a new approach to estimating differentiated product demand

systems that allows for products with zero sales in the data. Zeroes in demand are a common

problem in product differentiated markets, but fall outside the scope of existing demand es-

timation techniques. Our solution to the zeroes problem is based on constructing bounds for

the conditional expectation of the inverse demand. These bounds can be translated into mo-

ment inequalities that are shown to yield consistent and asymptotically normal point estimator

for demand parameters under natural conditions for differentiated product markets. In Monte

Carlo simulations, we demonstrate that the new approach works well even when the fraction of

zeroes is as high as 95%. We apply our estimator to supermarket scanner data and find price

elasticities become on the order of twice as large when zeroes are properly controlled.

Keywords: Demand Estimation, Differentiated Products, Profile, Measurement Error, Mo-

ment Inequality.

JEL: C01, C12, L10, L81.

1 Introduction

In this paper we introduce a new approach to differentiated product demand estimation that allows

for zeroes in empirical market share data. Such zeroes are a highly prevalent feature of demand in

a variety of empirical settings, ranging from workhorse scanner retail data, to data as diverse as

∗Previous version of this paper was circulated under the title “Estimating Demand for Differentiated Productswith Error in Market Shares.”†We are thankful to Steven Berry, Jean-Pierre Dube, Philip Haile, Bruce Hansen, Ulrich Muller, Aviv Nevo, Jack

Porter, and Chris Taber for insightful discussions and suggestions; We would also like to thank the participants atthe MIT Econometrics of Demand Conference, Chicago-Booth Marketing Lunch, the Northwestern Conference on“Junior Festival on New Developments in Microeconometrics,” the Cowles Foundation Conference on “StructuralEmpirical Microeconomic Models,” 3rd Cornell - Penn State Econometrics & Industrial Organization Workshop, aswell as seminar participants at Wisconsin-Madison, Wisconsin-Milwaukee, Cornell, Indiana, Princeton, NYU, Pennand the Federal Trade Commission for their many helpful comments and questions.

1

homicide rates and international trade flows (we discuss these examples in further depth below).

Zeroes naturally arise in “big data” applications which allow for increasingly granular views of

consumers, products, and markets (see for example Quan and Williams (2015), Nurski and Verboven

(2016)). Unfortunately, the standard estimation procedures following the seminal Berry, Levinsohn,

and Pakes (1995) (BLP for short) cannot be used in the presence of zero empirical shares - they

are simply not well defined when zeroes are present. Furthermore, ad hoc fixes to market zeroes

that are sometimes used in practice, such as dropping zeroes from the data or replacing them with

small positive numbers, are subject to biases which can be quite large (discussed further below).

This has left empirical work on demand for differentiated products without satisfying solutions to

the zero shares problem, which is the key void our paper aims to fill.

In this paper we provide an approach to estimating differentiated product demand models that

provides consistency (and asymptotic normality) for demand parameters despite a possibly large

presence of market zeroes in the data. We first isolate the econometric problem caused by zeroes

in the data. The problem we show is driven by the wedge between choice probabilities, which

are the theoretical outcome variables predicted by the demand model, and market shares, which

are the empirical revealed preference data used to estimate choice probabilities. Although choice

probabilities are strictly positive in the underlying model, market shares are often zero if choice

probabilities are small. The root of the zeroes problem is that substituting market shares (or

some other consistent estimate) for choice probabilities in the moment conditions that identify the

model, which is the basis for the traditional estimators, will generally lead to asymptotic bias.

While this bias is assumed away in the traditional approach, it cannot be avoided whenever zeroes

are prevalent in the data.

Our solution to this problem is to construct a set of moment inequalities for the model, which

are by design robust to the sampling error in market shares - our moment inequalities will hold at

the true value of the parameters regardless of the magnitude of the measurement error in market

shares for choice probabilities. Despite taking an inequality form, we use these moment inequalities

to form a GMM-type point estimator based on minimizing the deviations from the inequalities.

We show this estimator is consistent so long as there is a positive mass of observations whose

latent choice probabilities are bounded sufficiently away from zero, e.g., products for whom market

shares are not likely to be zero. This is natural in many applications (as illustrated in Section 2),

and strictly generalizes the restrictions on choice probabilities for consistency under the traditional

approach. Asymptotic normality then follows by adapting arguments from censored regression

models by Kahn and Tamer (2009).

Computationally, our estimator closely resembles the traditional approach with only a slight

adjustment in how the empirical moments are constructed. In particular it is no more burdensome

than the usual estimation procedures for BLP and can be implemented using either the standard

nested fixed point method of the original BLP, or the MPEC method as advocated more recently

by Dube, Fox, and Su (2012).

We investigate the finite sample performance of the approach in a variety of mixed logit ex-

2

amples. We find that our estimator works well even when the the fraction of zeros is as high as

95%, while the standard procedure with the observations with zeroes deleted yields severely biased

estimators even with mild or moderate fractions of zeroes.

We apply our bounds approach to widely used scanner data from the Dominicks Finer Foods

(DFF) retail chain. In particular, we estimate demand for the tuna category as previously studied

by Chevalier, Kashyap, and Rossi (2003) and continued by Nevo and Hatzitaskos (2006) in the

context of testing the loss leader hypothesis of retail sales. We find that controlling for products

with zero demand using our approach gives demand estimates that can be more than twice as

elastic than standard estimates that select out the zeroes. We also show that the estimated price

elasticities do not increase during Lent, which is a high demand period for this product category,

after we control for the zeroes. Both of these findings have implications for reconciling the loss-

leader hypothesis with the data.

The plan of the paper is the following. In Section 2, we illustrate the stylized empirical pattern

of Zipf’s law where market zeroes naturally arise. In Section 3, we describe our solution to the

zeroes problem using a simple logit setup without random coefficients to make the essential matters

transparent. In Section 4, we introduce our general approach for discrete choice model with random

coefficients. Section 5 and 6 present results of Monte Carlo simulations and the application to the

DFF data, respectively. Section 7 concludes.

2 The Empirical Pattern of Market Zeroes

In this section we highlight some empirical patterns that arise in applications where the zero shares

problem arises, which will also help to motivate the general approach we take to it in the paper.

Here we will primarily use workhorse store level scanner data to illustrate these patterns. It is

this same data that will also be used for our empirical application. However we emphasize that

our focus here on scanner data is only for the sake of a concrete illustration of the market zeroes

problem - the key patterns we highlight in scanner data are also present in many other economic

settings where demand estimation techniques are used (discussed further below and illustrated in

the Appendix).

We employ here a widely studied store level scanner data from the Dominick’s Finer Foods

grocery chain, which is public data that has been used by many researchers.1 The data comprises

93 Dominick’s Finer Foods stores in the Chicago metropolitan area over the years from 1989 to

1997. Like other store level scanner data sets, this data set provides demand information (price,

sales, marketing) at a store/week/UPC level, where a UPC (universal product code) is a unique

1For a complete list of papers using this data set, see the website of Dominick’s Database:http://research.chicagobooth.edu/marketing/databases/dominicks/index.aspx

3

bar code that identifies a product2.

Table 1 presents information on the resulting product variety across the different product cat-

egories in data. The first column shows the number of products in an average store/week - the

number of UPC’s can be seen varying from roughly 50 (e.g., bath tissue) to over four hundred

(e.g., soft drinks) within even these fairly narrowly defined categories. Thus there is considerable

product variety in the data. The next two columns illustrate an important aspect of this large

product variety: there are often just a few UPC’s that dominate each product category whereas

most UPC’s are not frequently chosen. The second column illustrates this pattern by showing the

well known “80/20” rule prevails in our data: we see that roughly 80 percent of the total quantity

purchased in each category is driven by the top 20 percent of the UPC’s in the category. In con-

trast to these “top sellers”, the other 80 percent of UPC’s contain relatively “sparse sellers” that

share the remaining 20 percent of the total volume in the category. The third column shows an

important consequence of this sparsity: many UPC’s in a given week at a store simply do not sell.

In particular, we see that the fraction of observations with zero sales can even be nearly 60% for

some categories.

Table 1: Selected Product Categories in the Dominick’s Database

Category

Average

Number of

UPC’s in a

Store/Week

Pair

Percent of

Total Sale of

the Top 20%

UPC’s

Percent of

Zero Sales

Beer 179 87.18% 50.45%

Cereals 212 72.08% 27.14%

Crackers 112 81.63% 37.33%

Dish Detergent 115 69.04% 42.39%

Frozen Dinners 123 66.53% 38.32%

Frozen Juices 94 75.16% 23.54%

Laundry Detergents 200 65.52% 50.46%

Paper Towels 56 83.56% 48.27%

Refrigerated Juices 91 83.18% 27.83%

Soft Drinks 537 91.21% 38.54%

Snack Crackers 166 76.39% 34.53%

Soaps 140 77.26% 44.39%

Toothbrushes 137 73.69% 58.63%

Canned Tuna 118 82.74% 35.34%

Bathroom Tissues 50 84.06% 28.14%

We can visualize this situation another way by fixing a product category (here we use canned

2Store level scanner data can often be augmented with a panel of household level purchases (available, for example,through IRI or Nielsen). Although the DFF data do not contain this micro level data, the main points of our analysisare equally applicable to the case where household level data is available. In fact our general choice model willaccommodate the possibility of micro data. Store level purchase data is actually a special case household level datawhere all households are observationally identical (no observable individual level characteristics).

4

Figure 1: Zipf’s Law in Scanner Data

tuna) and simply plotting the histogram of the volume sold for each week/UPC realization for a

single store in the data. This frequency plot is given in Figure 1. As can be see there is a sharp

decay in the empirical frequency as the purchase quantity becomes larger, with a long thin tail. In

particular the bulk of UPC’s in the store have small purchase volume: the median UPC sells less

than 10 units a week, which is less than 1.5% of the median volume of Tuna the store sells in a

week. The mode of the frequency plot is a zero share.

This power-law decay in the frequency of product demand is often associated with “Zipf’s

law” or the “the long tail”, which has a long history in empirical economics.3 We present further

illustrations of this long-tail demand pattern found in international trade flows as well as cross-

county homicide rates in Appendix A, which provides a sense of the generality of these stylized

facts.

The key takeaway from these illustrations is that the presence of market zeroes in the data is

closely intertwined to the prevalence of power-law patterns of demand. We will later exploit this

relationship to place structure on the data generating process that underlies market zeroes.

3 A First Pass Through Logit Demand

Why do zero shares create a problem for demand estimation? In this section, we use the workhorse

multinomial logit model to explain the zeroes problem and introduce our new estimation strategy.

Formal treatment for general differentiated product demand models is given in the next section.

3See Anderson (2006) for a historical summary of Zipf’s law and many examples from the social and naturalsciences. See Gabaix (1999a) for an application of Zipf’s law to the economics literature.

5

3.1 Zeroes Problem in the Logit Model

Consider a multinomial logit model for the demand of J products (j = 1, . . . , J) and an outside

option (j = 0). A consumer i derives utility uijt = δjt+ εijt from product j in market t, where δjt is

the mean-utility of product j in market t, and εijt is the idiosyncratic taste shock that follows the

type-I extreme value distribution. As is standard, the mean-utility δjt of product j > 0 is modeled

as

δjt = X ′jtβ + ξjt, (3.1)

where Xjt is the vector of observable (product, market) characteristics, often including price, and

ξjt is the unobserved characteristic. The outside good j = 0 has mean utility normalized to δ0t = 0.

The parameter of interest is β.

Each consumer chooses the product that yields the highest utility. Aggregating consumers’

choices, we obtain the true choice probability of product j in market t, denoted as

πjt = Pr(product j is chosen in market t).

The standard approach introduced by Berry (1994) for estimating β is to combine demand system

inversion and instrumental variables.

First, for demand inversion, one uses the logit structure to find that

δjt = ln (πjt)− ln (π0t) , for j = 1, . . . , J. (3.2)

Then, to handle the potential endogeneity of Xjt (correlation with ξjt), one finds a random vector

zjt, such that

E [ξjt| zjt] = 0. (3.3)

Then two stage least squares with δjt defined in terms of choice probabilities as the dependent

variable becomes the identification strategy for β.

Unfortunately πjt is not observed as data - it is a theoretical choice probability defined by the

model but only indirectly revealed through actual consumer choices. The standard approach to this

following Berry (1994), Berry, Levinsohn, and Pakes (1995), and many subsequent papers in the

literature has been to substitute sjt, the empirical market share of product j in market t based on

the choices of n potential consumers, for πjt, and run a two-stage least square with ln (sjt)− ln (s0t)

as dependent variable, xjt as covariates, and zjt as instruments to obtain estimates for β.

Plugging in the estimate sjt for πjt appears innocuous at first glance because the number of

potential consumers (n) in a market from which sjt is constructed is typically large. Nevertheless

problems arise when there are (jt)’s for which πjt is very small. Because the slope of the natural

logarithm function approaches infinity when the argument approaches zero, even small estimation

error of πjt may lead to large error in the plugged-in version of δjt when πjt is very small. In particu-

lar, sjt may frequently equal zero in this case, causing the demand inversion to fail completely. The

first is the theoretical root of the small πjt problem, while the second is an unmistakable symptom.

6

Data sets with this symptom are frequently encountered in empirical research as discussed in

the Section 2. With such data, a common practice is to ignore the (jt)’s with sjt = 0, effectively

lumping those j’s into the outside option in market t. This leads however to a selection problem.

To see this, suppose sjt = 0 for some (j, t) and one drops these observations from the analysis

- effectively one is using a selected sample where the selection criterion is sjt > 0. In this selected

sample, the conditional mean of ξjt is no longer zero, i.e.,

E[ξjt|xjt, sjt > 0] 6= 0. (3.4)

This is the well-known selection-on-unobservables problem and with such sample selection, an

attenuation bias ensues.4 The attenuation bias generally leads to demand estimates that appear

to be too inelastic.5

Another commonly adopted empirical “trick” is to add a small positive number ε > 0 to the

sjt’s that are zero, and use the resulting modified shares sεjt > 0 in place of πjt.6 However, this

trick only treats the symptom, i.e., sjt = 0, but overlooks the nature of the problem: the true choice

probability πjt is small. And in this case, small estimation error in any estimator πjt of πjt would

lead to large error in the plugged-in version of δjt and the estimation of β. This problem manifests

itself directly because the estimate β can be incredibly sensitive to the particular choice of the small

number being added with little guidance on what is the “right” choice of small number. In general,

like selecting away the zeroes, the “adding a small number trick” is also a biased estimator for β.

We illustrate both biases in the Monte Carlo section (Section 5).

Despite their failure as general solutions, these “ad hoc zero fixes” have in them what could be a

useful idea – Perhaps the variation among the non-zero share observations can be used to estimate

the model parameters, while at the same time the presence of zeroes is controlled in such a way

that avoids bias. We now present a new estimator that formalizes this possibility by using moment

inequalities to control for the zeroes in the data while using the variation in the remaining part of

the data to consistently estimate the demand parameters. We continue in this section to illustrate

our approach within the logit model before treatment of the general case in the next section.

3.2 A Bounds Estimator

Our bounds approach turns the selection-on-unobservable problem into a selection-on-observable

strategy, with the key features that the selection is not based on market share but on exogenous vari-

4In fact,E[ξjt|xjt, sjt > 0] > 0 (3.5)

in the homoskedastic case. This is because the criterion sjt > 0 selects high values of ξjt and leaves out low valuesof ξjt.

5It is easy to see that the selection bias is of the same direction if the selection criterion is instead sjt > 0 for allt, as one is effectively doing when focusing on a few top sellers that never demonstrate zero sales in the data. Thereason is that the event sjt > 0 for all t contains the event sjt > 0 for a particular t. If the markets are weaklydependent, the particular t part of the selection dominates.

6Berry, Linton, and Pakes (2004)and Freyberger (2015) study the biasing effect of plugging in sjt for πjt. Theirbias corrections do not apply when there are zeroes in the empirical shares.

7

ables, and is not determined ex-ante by the econometrician but rather automatically performed by

the estimator. Specifically, we assume that there exist a set of “safe product/market” (j, t) , identi-

fied by the instrumental variable zjt, with inherently thick demand such that sjt has a small chance

of being zero. In particular, we assume a partition on the support of zjt: supp(zjt) = Z = Z0 ∪Z1

that separates the safe product/markets (zjt ∈ Z0) from the remaining “risky product/markets”

(zjt ∈ Z1). 7 The safe products have inherently desirable characteristics that often make them the

“top sellers” described in Section 2, while the risky products have less attractive characteristics that

often yield sparse demand. If we knew Z0 and focused on the observations such that zjt ∈ Z0, the

standard estimator would be consistent. The key challenge in the data is that the econometricians

will not know Z0 in advance. Our bounds estimator automatically utilizes the variation in Z0,

but at the same time safely controls for the observations in Z1, to consistently estimate β without

requiring the researcher either to know or to estimate the underlying partition (Z0,Z1).

Our approach first uses two mean-utility estimators: δujt and δ`jt that are functions of empirical

market shares (rather than the true choice probability), to form bounds on E [δjt| zjt]:

E[δujt∣∣ zjt] ≥ E [δjt| zjt] ≥ E

[δ`jt

∣∣∣ zjt] ,∀j, t a.s. (3.6)

where δjt is the true mean-utility in (3.1). Next, the inequalities (3.6) combined with (3.3) imply

E[δujt − x

′jtβ∣∣∣ zjt] ≥ 0 ≥ E

[δ`jt − x

′jtβ∣∣∣ zjt] a.s. (3.7)

Observe that the moment restriction (3.3) implies that

E[(δjt − x

′jtβ)g (zjt)

]= 0 ∀g ∈ G,

where G is a set of instrumental variable functions. Using instead our upper and lower mean utility

estimators in place of the true mean utility we have the following moment inequalities

E[(δujt − x

′jtβ)g(zjt)

]≥ 0 ≥ E

[(δ`jt − x

′jtβ)g(zjt)

]∀g ∈ G. (3.8)

Following Andrews and Shi (2013), we take each g ∈ G to be an indicator function for a hypercube

Bg ⊆ supp (z), i.e.,

g(zjt) = 1 (zjt ∈ Bg) ,

and as long as G is rich enough, identification information in (3.7) is preserved by the moment

equalities (3.8).

7We will formalize the requirement on the partition in Section 4.

8

To form our estimator, define

ρuT (β, g) = (TJ)−1T∑t=1

J∑j=1

(δujt − x

′jtβ)g(zjt)

,

ρ`T (β, g) = (TJ)−1T∑t=1

J∑j=1

(x′jtβ − δ`jt

)g(zjt)

.

Let [a]− denote |min 0, a |. Our estimator is then

βBD = arg minθ

∑g∈G

µ(g)

[ρuT (β, g)]2− +

[ρ`T (θ, g)

]2

−

, (3.9)

where µ(g) is a probability density function on G, that is µ(g) > 0 for all g ∈ G, and∑

g∈G µ(g) = 1.

The function µ(g) is used to ensure summability of the terms, and the choice of µ(·) is discussed

in the next section.

Why is βBD consistent? A heuristic proof is as follows. Let us define the partition G = G0 ∪ G1

where each g ∈ G0 has support inside Z0. This partition does not need to be explicitly formed

by the econometrician (only the flexible set of instrumental variable functions G over the entire

support of zjt in the observed data is needed as an input), but only needs to exist in the underlying

DGP. We can then separate the objective function underlying (3.9) into two additive pieces

∑g∈G0

µ(g)

[ρuT (β, g)]2− +

[ρ`T (β, g)

]2

−

+∑g∈G1

µ(g)

[ρuT (β, g)]2− +

[ρ`T (β, g)

]2

−

. (3.10)

Notice that at the true parameter value β0, each of these sums in (3.10) converges in probability

to 0 because of the validity of the moment inequalities (3.8) at the true value β0. What happens

away from the true value β∗ 6= β0? Observe that the second sum over G1 is by construction

nonnegative regardless of the value of β. The first sum on the other hand approaches for each

g ∈ G0 the square of ∑g∈G0

µ(g)E[(δjt − x

′jtβ∗)g(zjt)

]because ρuT (β, g) and ρlT (β, g) converge as T → ∞ for g ∈ G0 (this is, for products whose zjt lies

in the safe set Z0). Then so long as the instruments g (zjt)g∈G0have sufficient variation for IV

rank condition with xjt to hold (the standard logit identifying condition), we are ensured that for

at least a positive mass of g ∈ G0 we have that

E[(δjt − x

′jtβ∗)g(zjt)

]6= 0.

Thus the first sum in (3.10) will converge in probability to a strictly positive number. Hence

the limiting value of the objective function (3.9) attains a minimum at the true value β0 and thus

9

by standard arguments βBD →p β0.

Figure 2 provides a graphical illustration of the above arguments. In the safe products region Z0,

the bounds are tight and provide identification power, while in Z1, the bounds may be uninformative

but still valid. So instrumental functions such as g1 ∈ G0 will form moment equalities that point

identify the model. Other instrumental functions, such as g2, g3 ∈ G1, are associated with slack

moment inequalities so they do not undermine the identification.

Figure 2: Illustration of Bounds Approach

E[δujt

∣∣∣Zjt]

E[δljt

∣∣∣Zjt]E [δjt|Zjt]

Z0Z1

g1g2

g3

Zjt

The bounds estimator thus controls for the zeroes in the data while using the variation among

the safe products to consistently estimate the model parameters. We now generalize this logic and

formalize it to the general differentiated product demand context with general error distribution

for the random utility model. We will show both consistency and asymptotic normality of the

estimator in this general case.

4 The General Model and Estimator

The researcher has data on a sample of markets t = 1 . . . , T , and for each market t, there is a sample

of individuals i = 1, . . . nt choosing from the j = 0, . . . , Jt products in the market. A product j in

market t is characterized by a vector of characteristics xjt ∈ Rdx that are observed to the researcher,

and a scalar unobserved product attribute ξjt. We will refer to the bundle (xjt, ξjt) as j′s product

characteristics (observed and unobserved). Note that to better match the feature of popular data

sets, we allow a t subscript for J , that is, different markets can have different number of products.

We will also allow a t subscript for n, the number of potential consumers.

10

In discrete choice models each consumer i = 1, . . . , nt in market t is assumed to make a single

choice from the product varieties j = 0, . . . , Jt in the market, where j = 0 denotes the outside

option of not purchasing. This choice is determined by maximizing a utility function that is random

from the perspective of the researcher. Specifically, the utility consumer i derives from consuming

product j in market t is given by

uijt = δjt + εijt,

where:

1. δjt is the mean-utility of product j in market t. Normalize δ0t = 0. As is standard, δjt is

modeled as

δjt = x′jtβ + ξjt, (4.1)

where xjt is the vector of observable (product, market) characteristics, often including price,

and ξjt is the vector of unobservable characteristics;

2. εijt is the idiosyncratic taste shock governed by the following distribution,

εit = (εi0t, . . . , εiJtt) ∼ F (· |xt;λ) , (4.2)

where xt stands for (x′1t, . . . , x

′Jtt

)′, and F (·|xt, λ) is a conditional cumulative distribution

function known up to the finite dimensional unknown parameter λ. Thus, the unknown

parameter in the model is θ = (β′, λ′)′. For clarity, we use θ0 ≡ (β′0, λ′0)′ to denote the true

value of the unknown parameter.

It is worth noting that allowing xt and the parameter λ to enter F makes this specification encom-

pass random coefficient specifications uijt = x′jtβi + ξjt, where βi follows some distribution (e.g.,

joint normal), because one can then view β as the mean of the random coefficients and εijt as the

sum of the products of the de-meaned random coefficients and the product characteristic xjt.8

We assume consumers demand the product that maximizes utility. Thus integrating out εit

yields a system of choice probabilities for agents in the market

σ(δt, xt, λ) ≡ (σ1(δt, xt, λ), . . . , σJt(δt, xt, λ))′,

where δt = (δ1t, . . . , δJtt)′. Then we obtain the demand system

πt ≡ (π1t, . . . , πJt)′ = σ(δt, xt, λ), (4.3)

where πjt = Pr(product j is chosen in market t) represents the true choice probability of product

j in market t. Let σ−1(πt, xt, λ) ≡ (σ−11 (πt, xt, λ), . . . , σ−1

Jt(πt, xt, λ))′ denote the inverse demand

function such that:

δt = σ−1(πt, xt, λ). (4.4)

8Requiring F (·|xt, λ) to be known up to a finite dimensional parameter rules out the vertical model because forthe vertical model, εit is a function of the unobservable product characteristics (quality).

11

Note that in the simple logit model, σj(δt, xt, λ) reduces to σj(δt) =exp(δjt)

1+∑Jtj′=1

exp(δj′t), and σ−1

j (πt, xt, λ)

reduces to σ−1(πt) = ln(πjt)− ln(π0t).

Inverting the demand system allows for the use of instrumental variables to identify θ. In

particular, instruments for the model are a random vector zjt that satisfies

E [ξjt |zjt ] = 0. (4.5)

Combining (4.4) and (4.5), the model yields the following moment restriction:

E[σ−1j (πt, xt, λ)− x′jtβ

∣∣∣ zjt] = 0. (4.6)

If πt is observed, identification can be stated as follows. The model is identified if and only if for

any θ = (β, λ) 6= θ0,

PrF (m∗F (θ, zjt) 6= 0 and zjt ∈ Z = supp (zjt)) > 0,

where

m∗F (θ, zjt) = E[σ−1j (πt, xt;λ)− x′jtβ

∣∣∣ zjt] (4.7)

Primitive conditions for identification are given in Berry and Haile (2014).

4.1 Bounds Estimator in the General Case

Like in the logit case, we construct a pair of inverse demand functions: δujt (λ) and δ`jt (λ), to form

bounds on E[σ−1j (πt, xt, λ)

∣∣∣ zjt], i.e.,

E[δujt (λ)

∣∣ zjt] ≥ E [σ−1j (πt, xt, λ)

∣∣∣ zjt] ≥ E [δ`jt (λ)∣∣∣ zjt] , a.s. (4.8)

These inequalities combined with (4.5) form the moment inequalities that our estimation of θ is

based upon:

E[δujt (λ)− x′jtβ

∣∣∣ zjt] ≥ 0 ≥ E[δ`jt (λ)− x′jtβ

∣∣∣ zjt] , a.s. (4.9)

To construct these upper and lower mean utility estimates δljt (λ) , δujt (λ), we start by applying

the Laplace rule of succession to obtain an initial choice probability estimator that does not have

zeros: sjt =nsjt+1n+J+1 .9 We call this the Laplace share estimator. It is a good estimator for the

choice probabilities when the prior information is only that these probabilities should be positive,

as argued in Jaynes (2003, Chap. 18), and thus provides a good starting point for our construction.

9The Laplace rule of succession was proposed by Pierre-Simon Laplace in the early 19th century to predict theprobability of an event happening given n independent past observations and the prior knowledge that the probabilitymust be strictly between 0 and 1. It is a concept fundamental to modern probability theory despite being widelymisunderstood and criticized. See Jaynes (2003, Chap. 18) for a thorough discussion.

12

We do not use the Laplace share estimator directly in place of πt, but use it to construct bounds

δujt (λ) and δ`jt (λ). Specifically, we define

δujt (λ) = ∆jt (st, xt;λ) + log

(sjt + ηts0t − ηt

)(4.10)

δ`jt (λ) = ∆jt (st, xt;λ) + log

(sjt − ηts0t + ηt

), (4.11)

where

∆jt (st, xt;λ) ≡ σ−1j (st, xt;λ)− log

(sjts0t

), (4.12)

and ηt is a scalar in (0, 1 /(nt + Jt + 1)).

It is instructive to consider the simple logit case, where σ−1j (st, xt;λ) = log

(sjts0t

), the term

∆jt (st, xt;λ) = 0 so the bounds boil down to

δujt = log


)and δ`jt = log

(sjt − ηts0t + ηt

). (4.13)

Thus the tuning term ηt perturbs (both in a positive and negative direction) the Laplace share

for each product one at a time, and δujt, δljt are then formed by applying the logit inversion to this

perturbed share.

Remark 1. Observe that the distance between δujt and δ`jt is large when sjt is small (i.e., respecting

the large error caused by the noise in st) and is negligible when sjt is large. Thus intuitively, for

observations such that zjt ∈ Z0, which defines the safe products in a market, sjt is large with a

high probability so that E[δujt

∣∣∣ zjt] and E[δ`jt

∣∣∣ zjt] closely resemble E [δjt| zjt]. On the other hand,

the difference between E[δujt

∣∣∣ zjt] and E[δ`jt

∣∣∣ zjt] may be large for risky products (i.e., zjt ∈ Z1)

because sjt has a high probability being close to zero. This feature of the construction is a key to

the consistency result to be discussed later.

We now formally establish the validity of the bounds defined by (4.10) and (4.11).

Assumption 1. The conditional distribution of (ntsjt)Jtj=0 given (πjt, xjt, zjt)

Jtj=1 is multinomial

with parameters nt and (πjt)Jtj=0.

Assumption 2. The inverse demand function σ−1j (·, xt, λ) is well-defined and continuous on the

probability simplex ∆Jt ≡ (p1, . . . , pJt) ∈ (0, 1)Jt : 1−∑Jt

j=1 pj > 0 for any xt and any λ.

Lemma 1. Suppose that Assumptions 1 and 2 hold. Then, there exists ηt ∈ (0, 1/(nt + Jt + 1))

such that the inequalities in (4.8) hold at λ = λ0 with δujt(λ) and δ`jt(λ) defined in (4.10) and (4.11).

Remark 2. The scalar ηt is chosen to guarantee equation (4.8). The ηt satisfying (4.8) may depend

on πt, xt, and nt, and thus may itself be a random variable, which makes it appear difficult to

choose. However, we find that a rule of thumb works very well in both our Monte Carlo and

13

empirical exercises. The rule is to choose, for example, ηt = 1−10−3

nt+Jt+1 to start with, and increasing

it to ηt = 1−10−4

nt+Jt+1 , and to ηt = 1−10−5

nt+Jt+1 , and so on, until the estimates stabilize. To see why this

rule of thumb is reasonable, it is useful to note that if one choice, say, η1t , satisfies (4.8), another

choice, say η2t , that lies between η1

t and 1/(nt + Jt + 1) also satisfies (4.8). This is so due to the

monotonicity of right-hand-side of (4.10) and (4.11) in ηt. On the other hand, using ηt’s that are

closer to the boundary 1/(nt + Jt + 1) will generally not hurt estimation precision much because

identification is based on the safe products, for which even the upper bound 1/(nt + Jt + 1) is

negligible relative to sjt with high probability. This suggests that we do not need to know the

precise range of ηt’s that work, but can afford to make a conservative choice, as our rule of thumb

does.

In order to estimate θ based on the moment inequalities (4.9), we first transform the conditional

moments inequalities into unconditional ones, following Andrews and Shi (2013), using a set G of

instrumental functions, where an instrumental function is a function of zjt. The set G that we use

is given below, and it guarantees that (4.9) is equivalent to

E[(δujt(λ)− x′jtβ)g(zjt)

]≥ 0 ≥ E

[(δ`jt(λ)− x′jtβ)g(zjt)

]. (4.14)

Andrews and Shi (2013) discussed many different choices of G including uncountable sets and

countable sets. We only consider countable G sets. Thus, given a data set of J products from T

markets, we can construct a sample criterion function as:

QT (θ) =∑g∈G

[ρuT (θ, g)]2− µ(g) +∑g∈G

[ρ`T (θ, g)

]2

−µ(g), (4.15)

where

ρuT (θ, g) = (T J)−1T∑t=1

Jt∑j=1

(δujt(λ)− x′jtβ)g(zjt)

,

ρ`T (θ, g) = (T J)−1T∑t=1

Jt∑j=1

(x′jtβ − δ`jt(λ))g(zjt)

, (4.16)

where J = T−1∑T

t=1 Jt is the average number of products on a market. The function µ(·) is a

probability distribution on G, which gives weights to each unconditional moment inequality. Our

choice for µ(·) is given below after the choice for G is introduced.

14

Our bound estimator for θ = (β′, λ′)′ is defined as 10

θBDT = arg minθQT (θ). (4.17)

Numerically solving for θBDT is not much different from solving for the standard BLP estimator. As

in the standard procedure, the criterion function is convex in β 11. Thus, it is useful to separate

the minimization problem into two steps:

minλ

minβQT (β, λ). (4.18)

The β minimization can be solved efficiently and accurately even when many control variables are

included in xjt. The λ minimization typically is a low-dimensional problem. One point worth noting

is that the inverse demand functions involved in the quantities δujt(λ) and δ`jt(λ) can be solved by

the same contraction mapping algorithm used in the standard BLP procedure. Alternatively, the

optimization problem (4.18) can be formulated and solved as a MPEC problem using the machinery

of Dube, Fox, and Su (2012).

Now we define the instrumental function collection G and the weight on it µ(·) that we use

in the simulation and the empirical application of this paper.12 For G, we divide the instrument

vector zjt into discrete instruments, zd,jt, and continuous instruments zc,jt. Let the set Zd be the

discrete set of values that zd,jt can take. Normalize the continuous instruments to lie in [0, 1]:

zc,jt = FN(0,1)

(Σ−1/2zc zc,jt

), where FN(0,1)(·) is the standard normal cdf and Σzc is the sample

covariance matrix of zc,jt. The set G is defined as

G = ga,r,ζ(zd, zc) = 1((z′c, z′d)′ ∈ Ca,r,ζ) : Ca,r,ζ ∈ C, where

C = (×dzcu=1((au − 1)/(2r), au/(2r)])× ζ : au ∈ 1, 2, ..., 2r, for u = 1, ..., dzc ,

r = r0, r0 + 1, ..., and ζ ∈ Zd. (4.19)

In practice, we truncate r at a finite value rT . This does not affect the first order asymptotic

property of our estimator as long as rT →∞. For µ(·), we use

µ(ga,r,ζ) ∝ (100 + r)−2(2r)−dzcK−1d for g ∈ Gd,cc, (4.20)

where Kd is the number of elements in Zd. The same µ measure is used and works well in Andrews

and Shi (2013).

10When there is not a partition in the space of zjt that distinguishes the safe products out, the moment inequalities(4.9) partially identify θ. In that case, the confidence set procedure in Andrews and Shi (2013), as well as the profilingapproach in an early version of this paper, may be used for inference. However, in the current version of this paper,we focus on the point identification case, which is much more computationally tractable.

11The convexity can be seen by examining the second order derivative of QT (θ) with respect to β.12We note that appropriate choices of G and µ are not unique. For other possible choices, see Andrews and Shi

(2013).

15

4.2 Consistency and Asymptotic Normality

In the asymptotic framework, we let the number of markets T go to infinity, and let the number of

consumers in each market, nt, be a function of T that also goes to infinity as T does. The number

of products Jt may also be a function of T that goes to infinity as the latter; it may also stay finite.

The key concept behind our approach is the notion of safe products. We define the safe products

according to the value that zjt takes. Let Z0 be a subset of Rdz , where dz is the dimension of zjt.

The product j is said to be a safe product in market t if zjt ∈ Z0. Thus, the instrumental variable

not only induces exogenous variation of the explanatory variables as in standard setup, but also

serves as an identifier of the safe products. The requirements on the set Z0 is listed below.

If j is a safe product in market t, its market share πjt tends to be sufficiently different from

zero, so that the slope of σ−1j (πt, xt;λ) at the true choice probability πt tends not to be huge. As

a result the inverse demand function σ−1j (πt, xt;λ) should be sufficiently close to σ−1

j (πt, xt;λ) for

a consistent estimator πt of πt. Thus, the first requirement is as follows.

Assumption 3. For any estimator πt of πt such that supj=0,...,Jt,t=1,...,T |πjt − πjt| →p 0, we have

(a) supt=1,...,T ;j=1,...,Jt:zjt∈Z0supλ |(σ−1

j (πt, xt;λ)− σ−1j (πt, xt;λ))| →p 0.

(b) supt=1,...,T ;j=0,...,Jt:zjt∈Z0|ln πjt − lnπjt| →p 0.

Remark 3. Assumption 3 is a strict generalization of the key consistency requirement for the

standard estimator as formalized by Assumption A.8 in Freyberger (2015) or Assumption A5 in

Berry, Linton, and Pakes (2004). Our Assumption 3 relaxes their approach by not placing any

restriction on the size of πjt for the risky products (zjt /∈ Z0). For those products, πjt can be

very small (asymptotically, it can approach zero very fast), and σ−1j (st, xt;λ) can be very different

from σ−1j (πt, xt;λ) (asymptotically, the two may not converge to each other), causing standard

estimators to fail.

In order to leverage the mass of safe product/market realizations in Z0 to achieve consistency,

we need to ensure that the variation in Z0 alone is enough to point identify θ0. This is the analogue

of the general identification condition (4.7) holding if we hypothetically selected the sample so that

zjt ∈ Z0.

Assumption 4. For any θ 6= θ0, PrF (m∗F (θ, zjt) 6= 0 and zjt ∈ Z0) > 0, where m∗F (θ, zjt) is defined

in (4.7).

Note that if the econometrician ex-ante knew Z0, then it would be straightforward to implement

the standard GMM estimator on the subsample Z0. But, we do not know Z0 ex-ante. The main idea

behind the design of our bound estimator is to automatically utilizes the identification information

in Z0 while safely controlling for the presence of the risky mass Z1, without requiring the researcher

either to know or to estimate the partition ex-ante.

Assumption 5 below is a regularity condition that guarantees that the identification in the

assumption above can be achieved through the instrumental functions defined in (4.19) in the

fashion of Andrews and Shi (2013). Part (a) is a moment condition that is stronger than needed for

16

obtaining the Andrews and Shi (2013) type result, but the extra strength is used for the consistency

of θBDT later.

Assumption 5. (a) EF [supθ∈Θ |σ−1j (πt, xt;λ)− x′jtβ|1zjt ∈ Z0] <∞.

(b) The set Z0 is a countable disjoint union of elements in C, where C is defined in (4.19).

Lemma 2. Under Assumptions 4 and 5, we have for any θ 6= θ0, there exists a ga,r,ζ ∈ G such

that Ca,r,ζ ⊆ Z0 and

EF [(σ−1j (πt, xt;λ)− x′jtβ)ga,r,ζ(zjt)] 6= 0. (4.21)

A few more standard assumptions are also needed for consistency. These are given next. Let

ρ∗F (θ, g) = EF [(σ−1j (πt, xt;λ)− x′jtβ)g(zjt)]. (4.22)

Assumption 6. (a) At any point λ, σ−1j (πt, xt;λ) is continuous in λ with probability one.

(b) supθ∈Θ

∣∣∣(T J)−1∑T

t=1

∑Jtj=1((σ−1

j (πt, xt;λ)− x′jtβ)g(zjt)− ρ∗F (θ, g))∣∣∣→p 0 for all g ∈ G0.

Assumption 7 (a) below is the same as the analogous condition in Freyberger (2015) and (b) is

weaker than requiring Jt to be bounded.

Assumption 7. (a) maxj,t |sjt − πjt| →p 0 as T →∞.

(b) mint=1,...,T nt → ∞ and maxt=1,...,T Jt/nt → 0.

Define:

ρuF,T (θ, g) = EF [(δujt(λ)− x′jtβ)g(zjt)]

ρ`F,T (θ, g) = EF [(x′jtβ − δ`jt(λ))g(zjt)]. (4.23)

These functions have the T subscript because the ηt that enters δujt(λ) and δ`jt(λ) depends on nt

and Jt, which depend on T . Assumption 8 is a uniform law of large number type requirement,

which is implied by some mild moment existence conditions if the markets are independent from

each other.

Assumption 8. For k = u, `, the functions ρkF,T is well defined, and

supg∈G

∣∣∣∣∣∣(T J)−1T∑t=1

Jt∑j=1

((δkjt(λ0)− β′0xjt)g(zjt)− ρkF,T (θ0, g))

∣∣∣∣∣∣→p 0.

The following theorem shows the consistency of the bound estimator.

Theorem 1. Suppose that Assumptions 1-8 hold. Then

‖θBDT − θ0‖ →p 0. (4.24)

17

More Assumptions are need to derive the asymptotic normality of the bound estimator. These

conditions are technical rather than illuminating. Thus, we relegate them to Appendix C.1.

Theorem 2. Suppose that Assumptions 1-8 and C.1-C.7 hold. Then√T J(θBDT − θ0)→d N(0,ΓV Γ),

where Γ =[∑

g∈G0

∂ρ∗F (θ0,g)∂θ

∂ρ∗F (θ0,g)∂θ′ µ(g)

]−1, and

V =∑

g,g∗∈G0

Σ(g, g∗)∂ρ∗F (θ0, g)

∂θ

∂ρ∗F (θ0, g∗)

∂θ′µ(g)µ(g∗), with (4.25)

G0 = ga,r,ζ ∈ G : Pr((z′c, zd)′ ∈ Ca,r,ζ) = Pr((z′c, zd)

′ ∈ Ca,r,ζ ∩ Z0), and Σ(g, g∗) being the limit of

of

Cov

(T J)−1/2T∑t=1

Jt∑j=1

(σ−1j (πt, xt;λ)− β′xjt)g(zjt), (T J)−1/2

T∑t=1

Jt∑j=1

(σ−1j (πt, xt;λ)− β′xjt)g∗(zjt)

.

A standard bootstrap procedure can be used to estimate the standard deviation of the estimator

in practice and we shall discuss the implementation details of this procedure in the empirical section.

4.3 Partial Identification as an Alternative

The approach above provides a consistent point estimator based on an underlying set of moment

inequalities. Point estimation relies on Assumptions 3 and 4, which allows for using variation among

safe products for consistency. This is natural in many applications where the long tail pattern is

present and we illustrate its performance in the Monte Carlo below. Nevertheless in settings where

these Assumptions are questionable, we can still use the underlying moment inequalities (4.14) as

a basis for partial identification and inference.

The model (4.14) is a moment inequality model with many moment conditions. One can use the

method developed in Andrews and Shi (2013) to construct a joint confidence set for the full vector

θ0. This confidence set is constructed by inverting an Anderson-Rubin test: CS = θ : T (θ) ≤ c(θ)for some test statistic T (θ) and critical value c(θ). Computing this set amounts to computing the

0-level set of the function T (θ)− c(θ), where c(θ) typically is simulated quantiles and thus a non-

smooth function of θ. This is feasible if the dimension of θ0 is moderate, especially if one has access

to parallel computing technology. If the dimension is high, however, the computational cost gets

exponentially higher, and methods for it have not been well developed.

On the other hand, in demand estimation, θ0 is high dimensional mainly because of many

control variables included in xjt. The coefficients of the control variables are nuisance parameters

that often are of no particular interest. The typical parameters of interest are the price coefficient

or the price elasticities, which are small dimensional. Based on this observation, we propose a

profiling method to profile out the nuisance parameters and only construct confidence sets for a

18

parameter of interest. Since this part of the discussion is rather technical and tangential to our

main contribution, we relegate it to Appendix D. Also, readers are referred to the early version of

this paper (Gandhi, Lu, and Shi (2013)) for Monte Carlo simulations and empirical results using

the profiling approach under partial identification.

5 Monte Carlo Simulations

In this section, we present two sets of Monte Carlo experiments with random coefficient logit models.

The first experiment investigates the performance of our approach with moderate fractions of zero

shares, which should cover most of the empirical scenarios. In the second experiment, we test

our estimator with a data generating process that produces extremely large fractions of zeros; the

purpose is to further illustrate the key idea of our estimator in exploiting the long tail pattern that

is naturally present in the data.

Both experiments use the a random coefficient logit model, where the utility of consumer i for

product j in market t is

uijt = α0 + xjtβ0 + λ0xjtvi + ξjt + εijt,

where vi ∼ N (0, 1) , λ0 is the standard deviation of the random coefficients on xjt, εijt’s are i.i.d.

across i, j and t following Type I extreme value distribution. The parameters of interest are β0

and λ0, while α0 is a nuisance parameter. In both experiments, we set λ0 = .5, β0 = 1 and vary α0

for different designs. We simulate T markets, each with J products.

5.1 Moderately Many Zeroes

In the first experiment, the observed and unobserved characteristics are generated as xjt = j10 +

N (0, 1) and ξjt ∼ N(0, .12

)for each product j in market t. Thus one feature of the design

is that the xjt has some persistence across markets - products with larger index tend to have

higher value of x (which respects the nature of the variation in the scanner data shown in Sec-

tion 2. Finally, the vector of empirical shares in market t, (s0t, s1t, ..., sJt), is generated from

Multinomial(n, [π0t, π1t, ..., πJt]

′)/

n, where n represents the number of consumers in each mar-

ket.13

With the simulated data set (sjt, xjt) : j = 1, ..., JTt=1, we compute our bound estimator

(bound), the standard BLP estimator using st in place of πt and discarding observations with

sjt = 0 (ES), the standard BLP estimator using st (no zeros) in place of πt (LS).

All the estimators require simulating the market shares and solving demand systems for each

trial of λ in optimizing the objective function for estimation. We use the same set of random draws

13The πt has no closed form solution in the random coefficient model, and thus, we compute them via simulation,i.e.,

πjt =1

s

s∑i=1

exp (α0 + xjtβ0 + λ0xjtvi + ξjt)

1 +∑Jk=1 exp (α0 + xktβ0 + λ0xktvi + ξkt)

,

where s = 1000 is the number of consumer type draws (vi).

19

of vi as in the data generating process to eliminate simulation error as it is not the focus of this

paper. BLP contraction mapping method is employed to numerically solve the demand systems.

We simulate 1000 datasets (srt , xrt ) : t = 1, ..., T1000r=1 and implement all the estimators men-

tioned above on each for a repeated simulation study. For the instrumental functions, we use the

countable hyper-cubes defined in (4.19), and set rT = 50. We let η = 1−ιn+J+1 with ι = 10−6 in

constructing the bounds on the conditional expectation of the inverse demand function. Setting

smaller ι, e.g., 10−10 gives virtually the same results as reported in the following tables. For the

BLP estimator, we use(

1, xjt, x2jt − 1, x3

jt − 3xjt

)(the first three Hermite polynomials) as instru-

ments to construct the GMM objective function. Alternative transformations of xjt as instruments

yield effectively the same results.

The bias and standard deviation of the estimators are presented in Table 2. As we can see

from the table, The standard estimator with st shows large bias for both β and λ. Replacing the

empirical share st with the Laplace share st (and thus not discarding the observations with sjt = 0)

increases the bias for β although reducing the bias for λ. Our bound estimators are the least biased,

and its bias is very small for both parameters, especially when the sample size (T ) is larger.

20

Table 2: Monte Carlo Results: Random-Coefficient Logit Model

DGP TAve. % ES Bound LS

of Zeros β λ β λ β λ

I

25 9.53%Bias -.1936 .3706 -.0443 .0436 -.2380 .2938

SD .0185 .0354 .0348 .0474 .0189 .0296

50 9.46%Bias -.1940 .3717 -.0236 .0195 -.2353 .2916

SD .0150 .0271 .0294 .0399 .0146 .0229

100 9.48%Bias -.1939 .3706 -.0081 .0018 -.2347 .2901

SD .0126 .0215 .0235 .0315 .0118 .0191

II

25 18.58%Bias -.6104 .6730 -.0329 .0169 -.4900 .3994

SD .0664 .0841 .0534 .0525 .0319 .0388

50 18.55%Bias -.6036 .6648 -.0040 -.0069 -.4867 .3970

SD .0528 .0662 .0403 .0399 .0242 .0300

100 18.53%Bias -.6018 .6613 .0037 -.0120 -.4865 .3960

SD .0394 .0489 .0299 .0298 .0199 .0250

III

25 41.16%Bias -1.3199 .7299 .0253 -.0344 -1.0112 .3830

SD .3056 .2201 .0725 .0487 .0564 .0476

50 41.12%Bias -1.2937 .7099 .0263 -.0299 -1.0060 .3794

SD .2003 .1418 .0550 .0375 .0430 .0367

100 41.07%Bias -1.2903 .7051 .0112 -.0171 -1.0044 .3762

SD .1435 .1028 .0394 .0282 .0342 .0282

IV

25 52.41%Bias -1.1039 .4041 .0453 -.0461 -1.1613 .2857

SD .2467 .1381 .0939 .0549 .0551 .0416

50 52.38%Bias -1.0969 .3973 .0260 -.0297 -1.1564 .2829

SD .1804 .1017 .0665 .0415 .0422 .0318

100 52.35%Bias -1.0901 .3922 .0104 -.0175 -1.1548 .2805

SD .1335 .0761 .0493 .0327 .0335 .0246

Note: 1. J = 50, N = 10, 000, β0 = 1, λ0 = .5, Number of Repetitions = 1000.

2. “ES”: Empirical Shares; “LS”: Laplace Shares.

3. DGP: I, II, III and IV correspond toα0 = −9, −10, −12 and −13, respectively.

5.2 Extremely Many Zeroes

Next we pressure test our bound estimator by pushing the fraction of zeroes in empirical shares

toward the extreme. We modify the DGP slightly to produce very high fraction of zeros. Specifically,

we generate xjt from the following discrete distribution

x 1 12 15

Pr (xjt = x) .99 .005 .005

and

ξjt ∼ 1 (xjt = 1)×N(0, 22

)+ 1 (xjt 6= 1)×N

(0, .12

).

All the other aspects of the DGP is the identical to the previous DGP.

The fractions of zeroes are made very high: 82%-96% by choosing the α0 parameter. With

such high fractions of zeroes, the vast majority of observations are uninformative. Thus, we need

21

larger sample size for any estimator to perform well. We consider T = 100, 200, 400. For simplicity

of presentation and to reduce computational burden, we will here fix λ at its true value, and only

investigate the behaviors of the estimators for β .

The results are reported in Table 3, and they are very encouraging for the bound approach. The

ES estimator is severely biased toward 0, so is the LS estimator. The bound estimator is remarkably

accurate in these extreme cases. The performance highlights the key idea of identification behind

our estimator: utilizing the information in safe products with inherently thick demand to identify

the model while controlling the risky products with small/zero sales properly.

Table 3: Monte Carlo Results: Very Large Fraction of Zeros

DGP TAve. % β

of Zeros ES Bound LS

I

100 82.91%Bias -.3222 -.0072 -.2643

SD .0272 .0342 .0240

200 82.92%Bias -.3219 -.0072 -.2633

SD .0142 .0095 .0041

400 82.94%Bias -.3194 -.0060 -.2633

SD .0267 .0068 .0031

II

100 89.59%Bias -.3777 -.0059 -.3311

SD .0129 .0133 .0063

200 89.57%Bias -.3777 -.0066 -.3308

SD .0125 .0095 .0045

400 89.55%Bias -.3759 -.0060 -.3308

SD .0230 .0066 .0033

III

100 96.35%Bias -.5613 -.0060 -.5499

SD .0090 .0139 .0090

200 96.36%Bias -.5615 -.0064 -.5498

SD .0069 .0097 .0064

400 96.35%Bias -.5605 -.0061 -.5495

SD .0102 .0071 .0046

Note: 1. T = 100, J = 50, N = 10, 000, β0 = 1, λ0 = .5,

Number of Repetitions = 1000.

2. We fix λ = λ0 (at the true value) without estimating it.

3. DGP: I, II, III correspond to α0 = −13, −14, −17.

6 Empirical Application

In this section, we apply our estimator on the same DFF scanner data previewed in Section 2. In

particular, we focus on the canned tuna category, as previously studied by Chevalier, Kashyap, and

Rossi (2003) (CKR for short) and Nevo and Hatzitaskos (2006) (NH for short). CKR observed

using the DFF data discussed in Section 2 that the share weighted price of tuna fell by 15 percent

during Lent (which we replicate below in our sample from the same data source), which is a high

demand period for this product. They attributed the outcome to loss-leading behavior on the part

22

of retailers. NH on the other hand suggest that this pricing pattern in the tuna data could instead

be explained by increased price sensitivity of consumers (consistent with an increase in search)

which causes a re-allocation of market shares towards less expensive products in the Lent period,

and hence a fall in the observed share weighted price index. They test this hypothesis directly in

the data by estimating demand parameters separately in the Lent and Non-Lent periods, and find

that demand becomes more elastic in the high demand (Lent) period.

Here we revisit the groundwork laid by NH to examine the difference in price elasticity between

Lent and non-Lent periods. The main difference in our analysis is that we use data on all products in

the analysis, while NH restrict the sample to include only the top 30 UPCs and thus automatically

drop products with small/zero sales. There are two main questions we seek to address are: a) Does

the selection of UPC’s with only positive shares significantly bias the estimates of price elasticity

and b) Does the difference in price elasticities between the Lent and Non-Lent period persist after

properly controlling for zeroes.

To make the comparison clear, we use largely the same specification of the model used in NH.

In particular we consider a logit specification

uijt = αpjt + βxjt + ξjt + εijt,

where the control variables xjt consist of UPC fixed effects and a time trend.14 Thus the week

to week variation in the product-/market-level unobserved demand shock ξjt largely captures the

short-term promotional efforts, e.g., in-store advertising and shelving choices, because the UPC

fixed effects control the intrinsic product quality that is likely to be stable over short time horizon.

Because stores are likely to advertise or shelf the product in a more prominent way during weeks

when the product is on a price sale, we expect a negative correlation between price and the unob-

servable. We construct instruments for price by inverting DFF’s data on gross margin to calculate

the chain’s wholesale costs, which is the standard price instrument in the literature that has studied

the DFF data.15

We implement our bound estimator defined by (4.17) to obtain point estimate of (α, β) in the

model. And the 95% confidence interval for the parameters are obtained using a standard bootstrap

procedure16.

14Empirical market shares are constructed using quantity sales and the number of people who visited the storethat week (the customer count) as the relevant market size.

15The gross margin is defined as (retail price - wholesale cost)/retail price, so we get wholesale cost using retailprice×(1 - gross margin). The instrument defensible in the store disaggregated context we consider here because ithas been shown that price sales in retail price primarily reflect a reduction in retailer margins rather than a reductionin marginal costs (see e.g., Chevalier, Kashyap, and Rossi (2003) and Hosken and Reiffen (2004)). Thus sales (andhence promotions) are not being driven by the manufacturer through temporary reduction in marginal costs.

16The procedure contains the following steps: 1) draw with replacement a bootstrap sample of markets, denoted

as t1, ..., tT ; 2) compute the bound estimator θBD∗T using the bootstrap sample; 3) repeat 1)-2) for BT times and

obtain BT independent (conditional on the original sample) copies of θBD∗T ; 4) q∗T (τ) is the τ -th quantile of the BT

copies of(θBD∗T − θBDT

), then the 95% bootstrap confidence interval is

[θBDT − q∗T (.975) , θBDT − q∗T (.025)

].

23

The estimation results are presented in Table 4 and 5. 17 Table 4 shows that standard logit

estimator that inverts empirical shares to recover mean utilities (and hence drops zeroes) has a

significant selection bias towards zero. The UPC level elasticities for the logit model are small

in economic magnitude, with the average elasticity in the data being -.572. Furthermore, over

90% percent of products having inelastic demand. Using our bounds approach instead to control

for zeroes has a major effect on the estimated elasticities. Average demand elasticity for UPC’s

becomes -1.362 and less than 35% percent of observations have inelastic demand. This change in

the direction of elasticities is consistent with the attenuation bias effects of dropping products with

small/zero market shares.

Table 4: Demand Estimation ResultsBLP Bound

Price Coefficient -.390 -.91095% CI [-.40, -.38] [-1.06, -.81]

Ave. Own Price Elasticity -.572 -1.362Fraction of Inelastic Products 90.04% 33.79%

No. of Obs. 862,683 959,331

Table 5: Demand in Lent vs. Non-LentBLP Bound

Lent Non-Lent Lent Non-Lent

Price Coefficient -.518 -.371 -.743 -.91195% CI [-55, -.48] [-.38, -.36] [-.84, -.45] [-1.01, -.65]

Ave. Own Price Elasticity -.757 -.544 -1.09 -1.302Fraction of Inelastic Products 84.02% 92.84% 43.65% 35.00%

No. of Obs. 70,496 792,187 78,838 880,493

Our second result is that we do not find evidence to suggest demand is becoming more elastic in

the high demand period, as shown in Table 5. Using the standard logit estimator with zeroes being

dropped shows findings consistent with Nevo and Hatzitaskos (2006) - demand appears more elastic

in the high demand Lent period. On the contrary, this effect disappears, and marginally changes

signs, under our bounds estimator that controls for the zeroes. Thus we do not see evidence in our

estimation of price elasticity being higher during the high demand period.

This finding can be rationalized if the magnitude of the selection problem with dropping zeroes

is different across the two periods. Such a change in the distribution of the unobservable ξjt in the

Lent period is indeed consistent with several features of the data. To see this, let us first recall that

the main reduced form fact in the data documented Nevo and Hatzitaskos (2006) that suggested

17In principle we can estimate our model separately for each store, letting preferences change freely over storesdepending on local preferences. These results are available upon request. Here we present for the results of demandpooling together all stores together as was done by Nevo and Hatzitaskos (2006). The store level regressions resultsare very similar to the pooled store regression and the latter is a more concise summary of demand behavior that wepresent here.

24

a change in price sensitivity in the Lent period. We replicate this reduced form finding in Table

6, which shows that although the price index of tuna during Lent appears to be approximately 15

percent less expensive than other weeks (as previously underscored by CKR), the average price of

tuna is virtually unchanged between the Lent versus non-Lent period. Hence it is a re-allocation of

demand towards less expensive products during Lent that drives the change in the aggregate price

index.

Table 6: Regression of Price Index on LentP P

(Price Index) (Average Price)

Lent -.150 -.009s.e. (.0005) (.0003)

We take this decomposition one step further than NH, and examine the price index separately

for products “on sale” and “regularly priced” during these periods.18 As can be seen in Table 7,

it is the sales price index that is the key driver of the aggregate price index being cheaper during

Lent. However the average price of an “on-sale” product is not cheaper in the Lent period. This

shows that it is a re-allocation towards more steeply discounted “on-sale” product during Lent

that is driving this change in the aggregate price index. But we do not see a corresponding such

reallocation for “regularly priced” products.

Table 7: Regression of Sales Price Index on LentP P

(Price Index) (Average Price)

Sale Regular Sale Regular

Lent -.199 .035 .010 .001s.e. (.0017) (.0003) (.0016) (.0003)

This suggests a tighter coordination of promotional effort and discounting in the high demand

period. In effect more steeply discounted products are receiving larger promotional effort on the

part of the retailer during the high demand, which is closer in spirit to the loss-leader hypothesis

originally advanced for this data by CKR. Because promotional effort in the model is largely

captured through the unobservable ξjt, this change in behavior of the unobservable would also

account for the selection effect due to dropping zeroes changing across the two periods. This

hypothesis is also consistent with our estimated model: the correlation between pjt and ξjt among

products that are flagged as being on sale (having at least a 5% reduction from highest price of

previous 3 weeks) increases from -.16 to -.24 between the Non-Lent and Lent periods.

18We flag an observation in the data as being on sale if that particular UPC in that particular store in thatparticular week has at least a 5% reduction from highest price of previous 3 weeks.

25

7 Conclusion

We have shown that differentiated product demand models have enough content to construct a

system of moment inequalities that can be used to consistently estimate demand parameters despite

a possibly large presence of observations with zero market shares in the data. We construct a GMM-

type estimator based on these moment inequalities that is consistent and asymptotically normal

under assumptions that are a reasonable approximation to the DGP in many product differentiated

environments. Our application to scanner data reveals that taking the market zeroes in the data

into account has economically important implications for price elasticities.

A key message from our analysis is that it is critical to not ignore the zero shares when estimating

discrete choice models with disaggregated market data. And a potentially fruitful area for future

research is the application of our approach is individual level choice data, such as a household panel.

Aggregating over households is still necessary to control for price endogeneity, such as described

by Berry, Levinsohn, and Pakes (2004) and Goolsbee and Petrin (2004), and thus zero market

shares when we aggregate over limited sample of households in the data is a clear problem for

many contexts. Nevertheless the demographic richness in the household panel provides additional

identifying power for random coefficients. The approach we describe can offer a novel solution to

the joint problem of endogenous prices and flexible consumer heterogeneity with micro data, which

we plan to pursue in future work.

26

A Further Illustrations of Zipf’s Law

In Figure 3 we illustrate this regularity using data from the two other applications that were

mentioned in Section 2: homicide rates and international trade flows. The left hand graph shows

the annual murder rate (per 10,000 people) for each county in the US from 1977-1992 (for details

about the data see Dezhbakhsh, Rubin, and Shepherd (2003)). The right hand side graph shows the

import trade flows (measured in millions of US dollars) among 160 countries that have a regional

trade agreement in the year 2006 (for details about the data see Head, Mayer, et al. (2013)). In each

of these two cases we see the characteristic pattern of Zipf’s law - a sharp decay in the frequency

for large outcomes and a large mass near zero (with a mode at zero in each case).

Figure 3: Zipf’s Law in Crime and Trade Data

27

B Proofs Lemma 1 and Theorem 1

B.1 Proof of Lemma 1

Proof of Lemma 1. First consider the derivation:

E

[ln

(sjt + η

s0t − η

)∣∣∣∣πt, xt]= E

[ln

(ntsjt + 1

nt + Jt + 1+ η

)∣∣∣∣πt, xt]− E [ ln

(nts0t + 1

nt + Jt + 1− η)∣∣∣∣πt, xt]

≥ ln

(1

nt + Jt + 1+ η

)− E

[ln

(ns0t + 1

nt + Jt + 1− η)∣∣∣∣πt, xt]

≥ ln

(1

nt + Jt + 1+ η

)− ln

(nt + 1

nt + Jt + 1− η)

Pr(ns0t ≥ 1|πt)−

ln

(1

nt + Jt + 1− η)

Pr(nts0t = 0|πt)

≥ ln

(1

nt + Jt + 1+ η

)− ln

(nt + 1

nt + Jt + 1− η)− ln

(1

nt + Jt + 1− η)

(1− π0t)n

≥ ln

(1 + η(nt + Jt + 1)

nt + 1 + η(nt + Jt + 1)

)− ln

(1

nt + Jt + 1− η)

(1− π0t)nt , (B.1)

where the first inequality holds because ntsjt ≥ 0, the second inequality holds because nts0t ≤ nt,

the third inequality holds by Pr(nts0t ≥ 1|πt) ≤ 1 and Assumption 1. As η approaches 1/(nt+Jt+1)

from below, the right-hand-side diverges to positive infinity. Therefore, for any finite (πt, xt)-

measuable quantity, there exists an ηt ∈ (0, 1/(nt + Jt + 1)) such that E[

ln(sjt+ηs0t−η

)∣∣∣πt, xt] is

greater than this quantity when η = ηt.

Next, define the ε-shrinkage of the J dimensional simplex be ∆εJ = (p1, . . . , pJ) ∈ (0, 1)J : pj ≥

ε, 1−∑J

j=1 pj ≥ ε. By the definition of the Laplace share, ∆jt(st, xt, λ0) lies in the interval minπ∈∆

1/(nt+Jt+1)Jt

∆jt(π, xt, λ0), maxπ∈∆

1/(nt+Jt+1)Jt

∆jt(π, xt, λ0)

. (B.2)

The interval is well-defined and finite by Assumption 2. Similarly, δjt(λ0) is finite. Therefore, there

exists ηt such that

E

[ln


)∣∣∣∣πt, xt] ≥ − minπ∈∆

1/(nt+Jt+1)Jt

∆jt(π, xt, λ0) + δjt(λ0)

≥ −E[∆jt(st, xt, λ0)|πt, xt] + δjt(λ0). (B.3)

This shows that E[δujt(λ0)|πt, xt] ≥ δjt(λ0), which implies that

E[δujt(λ0)|zjt] ≥ E[δjt(λ0)|zjt]. (B.4)

28

This proves the upper bound part of (4.8). The lower bound part is analogous and thus omitted.

B.2 Proof of Theorem 1

Next, we prove Theorem 1. To do so, we present three lemmas first. Proofs of these lemmas are

presented after that of Theorem 1. Consider the subset of the instrumental function collection:

G0 = ga,r,ζ ∈ G : Pr((z′c, zd)′ ∈ Ca,r,ζ) = Pr((z′c, zd)

′ ∈ Ca,r,ζ ∩ Z0). (B.5)

Let

Q∗0(θ) =∑g∈G0

(ρ∗F (θ, g))2µ(g). (B.6)

Lemma 3. Suppose that Assumptions 1-6 hold. Then for any c > 0,

infθ∈Θ:‖θ−θ0‖>c

Q∗0(θ) > 0. (B.7)

Let Q0,T (θ) =∑

g∈G0

([ρuT (θ, g)]2− +

[ρ`T (θ, g)

]2−

)µ(g)

.

Lemma 4. Suppose that Assumptions 3-7 hold. Then, supθ∈Θ |Q0,T (θ)−Q∗0(θ)| →p 0.

Lemma 5. Suppose that Assumption 8 hold. Then,

QT (θ0) = op(1). (B.8)

Proof of Theorem 1. Consider an arbitrary c > 0. Let q = infθ∈Θ:‖θ−θ0‖>cQ∗0(θ). Then q > 0. The

theorem is implied by the following derivation:

Pr(‖θBDT − θ0‖ > c

)≤ Pr

(Q∗0,T (θBDT ) ≥ q

)= Pr

(Q∗0(θBDT )− Q0,T (θBDT ) + Q0,T (θBDT ) ≥ q

)≤ Pr

(supθ∈Θ|Q∗0(θ)− Q0,T (θ)|+ Q0,T (θBDT ) ≥ q

)≤ Pr

(supθ∈Θ|Q∗0(θ)− Q0,T (θ)|+ QT (θBDT ) ≥ q

)≤ Pr

(supθ∈Θ|Q∗0(θ)− Q0,T (θ)|+ QT (θ0) ≥ q

)≤ Pr

(supθ∈Θ|Q∗0(θ)− Q0,T (θ)| ≥ q/2

)+ Pr

(QT (θ0)) ≥ q/2

)→ 0, (B.9)

29

where the first inequality holds by Lemma 3, the third inequality holds because QT (θ) differs from

Q0,T (θ) only in that the forms takes the integral over a larger range, and because the common

integrant of both are non-negative, the fourth inequality holds because QT (θBDT ) ≤ QT (θ0) by the

definition of θBDT and the convergence holds by Lemmas 4 and 5.

Proof of Lemma 3. Consider a sequence θm∞m=1 such that

limm→∞

Q∗0(θm) = infθ∈Θ:‖θ−θ0‖>c

Q∗0(θ). (B.10)

Because Θ is compact, it is without loss of generality to assume that limm→∞ θm = θ∗ for some

θ∗ ∈ Θ.

Because ‖θm − θ0‖ > c for all m, we have

‖θ∗ − θ0‖ ≥ c. (B.11)

That is, θ∗ 6= θ0. By Lemma 2, there exists a g∗ ∈ G0 such that ρ∗F (θ∗, g∗) 6= 0. Next, we show that

limm→∞

ρ∗F (θm, g∗) = ρ∗F (θ∗, g∗). (B.12)

Once this is established, the result of the lemma is implied by

infθ∈Θ:‖θ−θ0‖>c

Q∗0(θ) = limm→∞

Q∗0(θm) ≥ limm→∞

µ(g∗)ρ∗F (θm, g∗)2 = µ(g∗)ρ∗F (θ∗, g∗)2 > 0. (B.13)

We show (B.12) using the dominated convergence theorem (DCT). By Assumption 6(a), we have

with probability one,

(σ−1j (πt, xt, λm)− β′mxjt)g∗(zjt)→ (σ−1

j (πt, xt, λ∗)− β∗,′xjt)g∗(zjt). (B.14)

Also observe that |(σ−1j (πt, xt, λm) − β′mxjt)g∗(zjt)| ≤ supθ∈Θ |(σ−1

j (πt, xt, λ) − β′xjt)|1zjt ∈ Z0.The right-hand-side is integrable by Assumption 5(a). Therefore, the DCT applies and yields

(B.12)

Proof of Lemma 4. First, we show that,

supg∈G0

supλ

∣∣∣∣∣∣(T J)−1T∑t=1

Jt∑j=1

(δujt(λ)− σ−1j (πt, xt;λ))g(zjt)

∣∣∣∣∣∣→p 0. (B.15)

30

To show this, observe that with probability one, the left-hand-side is less than or equal to

supt,j:zjt∈Z0

supλ|(σ−1

j (st, xt;λ)− σ−1j (πt, xt;λ)|+

supt,j:zjt∈Z0

∣∣∣∣ln( sjt + η

s0t − η

)− ln

(sjts0t

)∣∣∣∣ (B.16)

The in-probability convergence of this to zero is implied by Assumptions 3 and the fact that

η ≤ 1/(nt+Jt+1)→ 0, as long as we can show that maxj=0,...,Jt;t=1,...,T |sjt−πjt| →p 0. The latter

convergence is true by the following derivation:

maxj;t|sjt − πjt|

≤ maxj;t|sjt − πjt + (nt + Jt + 1)−1 − (Jt + 1)sjt/(nt + Jt + 1)|

≤ maxj;t|sjt − πjt|+ |n−1

t |+ |(Jt + 1)/nt|

→p 0, (B.17)

where the convergence holds by Assumption 7. Therefore, (B.15) is proved.

Equation (B.15) and Assumption 6(b) together shows that for any g ∈ G0,

supg∈G0

supθ∈Θ|ρuT (θ, g)− ρ∗F (θ, g)| →p 0 (B.18)

Similarly, for any g ∈ G0,

supg∈G0

supθ∈Θ

∣∣∣ρlT (θ, g) + ρ∗F (θ, g)∣∣∣→p 0 (B.19)

Then we can show that

supθ∈Θ|Q0,T (θ)−Q∗0(θ)| ≤

∑g∈G0

supθ∈Θ|[ρuT (θ, g)]2− + [ρlT (θ, g)]2− − ρ∗F (θ, g)2|µ(g)

≤∑g∈G0

supθ∈Θ|[ρuT (θ, g)]2− − [ρ∗F (θ, g)]2−|µ(g)+

∑g∈G0

supθ∈Θ|[ρ`T (θ, g)]2− − [−ρ∗F (θ, g)]2−|µ(g)

= op(1). (B.20)

This concludes the proof of the lemma.

Proof of Lemma 5. The lemma is immediately implied by Assumption 8 and equation (4.14).

31

C Assumptions and Proof of Asymptotic Normality

C.1 Additional Assumptions for Asymptotic Normality

We derive the asymptotic normality of our bound point estimator using similar techniques as Khan

and Tamer (2009).

Additional assumptions are needed. We divide the assumptions in to two groups. The first

group, Assumptions C.1-C.3 are needed for deriving the convergence rate. On top of those, As-

sumptions C.4-C.7 are needed for the asymptotic normality.

Assumption C.1. For an ε > 0 and an open ball, Bε(θ0), of radius ε around θ0,

supg∈G,θ∈Bε(θ0)

∣∣ρuT (θ, g)− ρuF,T (θ, g)∣∣+ |ρ`T (θ, g)− ρ`F,T (θ, g)|

= Op((T J)−1/2).

Assumption C.2. (a) σ−1j (πt, xt;λ) is continuously differentiable in λ in Bε(λ0)—an open ball

around λ0, and E‖xjt‖ <∞.

(b) E supλ∈Bε(λ0) ‖∂σ−1j (πt, xt;λ)/∂λ‖ <∞ and E‖xjt‖ <∞.

(c)∑

g∈G0

∂ρ∗F (θ0,g)∂θ

∂ρ∗F (θ0,g)∂θ′ µ(g) is positive definite.

Assumption C.3. For an ε > 0, we have

supg∈G0,θ∈Bε(θ0)

∣∣ρuF,T (θ, g)− ρ∗F (θ, g)∣∣+∣∣∣ρ`F,T (θ, g) + ρ∗F (θ, g)

∣∣∣ = O((T J)−1/2).

Assumption C.4. (a) (T J)1/2ρuF,T (θ0, g)→∞ and (T J)1/2ρ`F,T (θ0, g)→∞ for all g ∈ G\G0.

(b) ρuF,T (θ, g) : g ∈ G : T = 1, 2, 3, . . . and ρ`F,T (θ, g) : g ∈ G, T = 1, 2, 3, . . . are equi-continuous

in θ at θ0,

(c) ρuF,T (θ, g) and ρ`F,T (θ, g) are differentiable in θ for every g ∈ G and T .

Assumption C.5. (a) For any ε > 0, we have supθ∈Bε(θ0),g∈G

∥∥∥∂ρuT (θ,g)∂θ − ∂ρuF,T (θ,g)

∂θ

∥∥∥→p 0, and also

supθ∈Bε(θ0),g∈G

∥∥∥∥∂ρ`T (θ,g)∂θ − ∂ρ`F,T (θ,g)

∂θ

∥∥∥∥→p 0,

(b) ∂ρuF,T (θ,g)

∂θ : g ∈ G, T = 1, 2, 3, . . . , and ∂ρ`F,T (θ,g)

∂θ : g ∈ G, T = 1, 2, 3, . . . are equi-continuous

in θ at θ0.

Define the infeasible sample moment function:

ρ∗T (θ, g) = (T J)−1T∑t=1

Jt∑j=1

(σ−1j (πt, xt;λ)− β′xjt)g(zjt)

. (C.1)

Assumption C.6. (a) supθ∈Bε(θ),g∈G0‖ρuT (θ, g) − ρ∗T (θ, g)‖ = op((T J)−1/2), and supθ∈Bε(θ),g∈G0

‖ρ`T (θ, g) + ρ∗T (θ, g)‖ = op((T J)−1/2).

(b) supθ∈Bε(θ),g∈G0

∥∥∥∂ρuT (θ,g)∂θ − ∂ρ∗T (θ,g)

∂θ

∥∥∥ = op(1), and supθ∈Bε(θ),g∈G0

∥∥∥∂ρ`T (θ,g)∂θ +

∂ρ∗T (θ,g)∂θ

∥∥∥ = op(1).

32

Assumption C.7. (a) (T J)1/2ρ∗T (θ0, ·)→d νΣ(·), where νΣ(g) : g ∈ G0 is a tight Gaussian process

with variance covariance kernel Σ(g, g∗) : (g, g∗) ∈ G20 .

(b) supg∈G0,θ∈Bε(θ0) ‖∂ρ∗T (θ,g)

∂θ − ∂ρ∗F (θ,g)∂θ ‖ = op(1).

C.2 Proof of Asymptotic Normality

We now prove Theorem 2. The proof uses the following lemma. The proof of the lemma is given

after that of Theorem 2.

Lemma C.1. Suppose that Assumptions 1-8 and C.1-C.3 are satisfied. Then

‖θBDT − θ0‖ = Op((T J)−1/2).

Proof of Theorem 2. The criterion function is differentiable. Thus, we have the first-order-condition:

0 =∂QT (θBDT )

∂θ

= 2∑g∈G

[ρuT (θBDT , g)]−∂ρuT (θBDT , g)

∂θµ(g) + 2

∑g∈G

[ρ`T (θBDT , g)]−∂ρ`T (θBDT , g)

∂θµ(g). (C.2)

Consider the derivation: for all g ∈ G\G0,

Pr(ρuT (θBDT , g) < 0) = Pr((T J)1/2(ρuT ((θBDT , g)− ρuF,T (θ0, g)) < −T 1/2ρuF,T (θ0, g))

→ 0, (C.3)

where the convergence holds by Assumptions C.1 and C.4(a). Similarly, we have, for every g ∈ G\G0

Pr(ρ`T (θBDT , g) < 0)→ 0. (C.4)

Thus, for every g ∈ G\G0

Pr

([ρuT (θBDT , g)]−

∂ρuT (θBDT , g)

∂θ= 0

)→ 0 and

Pr

([ρ`T (θBDT , g)]−

∂ρ`T (θBDT , g)

∂θ= 0

)→ 0. (C.5)

Because G\G0 is a countable set, the above convergence implies that, for any subsequence of T∞T=1,

there exists a further subsequence aT ∞T=1 such that

(aT JaT )1/2[ρuaT (θBDaT , g)]−∂ρuaT (θBDaT , g)

∂θ→ 0 and (aT JaT )1/2[ρàT (θBDaT , g)]−

∂ρàT (θBDaT , g)

∂θ→ 0,

(C.6)

almost surely for every g ∈ G\G0, where JaT =∑aT

t=1 Jt. By the bounded convergence theorem

33

(applied sample path by sample path), we have

∑g∈G\G0

µ(g)

((aT JaT )1/2[ρuaT (θBDaT , g)]−

∂ρuaT (θBDaT , g)

∂θ+ (aT JaT )1/2[ρàT (θBDaT , g)]−

∂ρàT (θBDaT , g)

∂θ

)→ 0

(C.7)

almost surely. Thus,

∑g∈G\G0

µ(g)

((T J)1/2[ρuT (θBDT , g)]−

∂ρuT (θBDT , g)

∂θ+ (T J)1/2[ρ`T (θBDT , g)]−

∂ρ`T (θBDT , g)

∂θ

)→p 0.

(C.8)

This implies that

∂QT (θBDT )

∂θ

= op((T J)−1/2) + 2∑g∈G0

µ(g)

([ρuT (θBDT , g)]−

∂ρuT (θBDT , g)

∂θ+ [ρ`T (θBDT , g)]−

∂ρ`T (θBDT , g)

∂θ

). (C.9)

Next consider the following derivation:

∑g∈G0


∂θµ(g)−

∑g∈G0

[ρ∗T (θBDT , g)]−∂ρ∗T (θBDT , g)

∂θµ(g)

=∑g∈G0

([ρuT (θBDT , g)]− − [ρ∗T (θBDT , g)]−)(∂ρuT (θBDT , g)

∂θ−∂ρ∗T (θBDT , g)

∂θ)µ(g)

+∑g∈G0

([ρuT (θBDT , g)]− − [ρ∗T (θBDT , g)]−)∂ρ∗T (θBDT , g)

∂θµ(g)

+∑g∈G0

[ρ∗T (θ0, g) +∂ρ∗T (θT , g)

∂θ(θT − θ0)]−(

∂ρuT (θBDT , g)

∂θ−∂ρ∗T (θBDT , g)

∂θ)µ(g). (C.10)

The first summand on the right-hand-side is op((T J)−1/2) by Assumption C.6. The second sum-

mand is op((T J)−1/2) by Assumption C.6(a), and

supg∈G0

‖∂ρ∗T (θBDT , g)

∂θ−∂ρ∗F (θ0, g)

∂θ‖ = op(1), (C.11)

which holds by Assumptions C.5(a)-(b), C.6(b), and C.7(b). The third summand is op((T J)−1/2)

by Assumption C.6(a), C.7(a), Lemma C.1 and

supg∈G0

‖∂ρ∗T (θT , g)

∂θ−∂ρ∗F (θ0, g)

∂θ‖ = op(1), (C.12)

34

which holds similarly to (C.11). Therefore,

∑g∈G0


∂θµ(g) = op((T J)−1/2) +

∑g∈G0

[ρ∗T (θBDT , g)]−∂ρ∗T (θBDT , g)

∂θµ(g). (C.13)

Similarly, we can show that

∑g∈G0

[ρ`T (θBDT , g)]−∂ρ`T (θBDT , g)

∂θµ(g) = op((T J)−1/2)−

∑g∈G0

[−ρ∗T (θBDT , g)]−∂ρ∗T (θBDT , g)

∂θµ(g).

(C.14)

Equations (C.2), (C.9), (C.13), and (C.14) together show that

0 =∂QT (θBDT )

∂θ= op((T J)−1/2) + 2

∑g∈G0

ρ∗T (θBDT , g)∂ρ∗T (θBDT , g)

∂θµ(g). (C.15)

Apply a mean-value expansion of ρ∗T (θBDT , g) around θ0, and we get

op((T J)−1/2) =∑g∈G0

ρ∗T (θ0, g)∂ρ∗T (θBDT , g)

∂θµ(g) +

∑g∈G0

∂ρ∗T (θT , g)

∂θ

∂ρ∗T (θBDT , g)

∂θ′µ(g)

(θBDT − θ0).

(C.16)

Therefore,

(T J)1/2(θBDT − θ0)

= op(1) +

∑g∈G0

∂ρ∗T (θT , g)

∂θ

∂ρ∗T (θBDT , g)

∂θ′µ(g)

−1 ∑g∈G0

ρ∗T (θ0, g)∂ρ∗T (θBDT , g)

∂θµ(g)

→d

∑g∈G0

∂ρ∗F (θ0, g)

∂θ

∂ρ∗F (θ0, g)

∂θ′µ(g)

−1 ∑g∈G0

νΣ(g)∂ρ∗F (θ0, g)

∂θµ(g)

=d N(0,ΓV Γ), (C.17)

where Γ =[∑

g∈G0

∂ρ∗F (θ0,g)∂θ

∂ρ∗F (θ0,g)∂θ′ µ(g)

]−1, and

V =∑

g,g∗∈G0

Σ(g, g∗)∂ρ∗F (θ0, g)

∂θ

∂ρ∗F (θ0, g∗)

∂θ′µ(g)µ(g∗). (C.18)

This concludes the proof of the theorem.

35

Proof of Lemma C.1. Below, we show the following results:

Q0,T (θBDT ) = Op((T J)−1), (C.19)

Q0(θBDT ) ≥ c‖θT − θ0‖2, (C.20)

Q0,T (θBDT )−Q0(θBDT ) = Op((T J)−1) +Op((T J)−1/2)‖θT − θ0‖, (C.21)

for some c > 0.

Equations (C.19)-(C.21) together imply that

c‖θT − θ0‖2 +Op(T−1/2)‖θT − θ0‖ = Op((T J)−1). (C.22)

This implies that

(c1/2‖θT − θ0‖+Op(T−1/2))2 = Op((T J)−1), (C.23)

which then implies the conclusion of the theorem.

Now we show (C.19). Observe that Q0,T (θBDT ) ≤ QT (θBDT ) ≤ QT (θ0). Thus, it suffices to show

that

QT (θ0) = Op((T J)−1). (C.24)

Consider the derivation:

QT (θ0) =∑g∈G

[ρuT (θ0, g)− ρuF,T (θ0, g) + ρuF,T (θ0, g)]2−µ(g)

+∑g∈G

[ρ`T (θ0, g)− ρ`F,T (θ0, g) + ρ`F,T (θ0, g)]2−µ(g)

≤∑g∈G

[ρuT (θ0, g)− ρuF,T (θ0, g)]2−µ(g)

+∑g∈G

[ρ`T (θ0, g)− ρ`F,T (θ0, g)]2−µ(g)

= Op((T J)−1), (C.25)

where the inequality holds because ρ`F,T (θ0, g) ≥ 0, and ρuF,T (θ0, g) ≥ 0 for all g ∈ G by definition,

and the second equality holds by Assumption C.1. This shows (C.19).

36

Next, we show (C.20). Consider the derivation:

Q0(θBDT ) =∑g∈G0

(θBDT − θ0)′∂ρ∗F (θT , g)

∂θ

∂ρ∗F (θT , g)

∂θ′(θT − θ0)µ(g)

= (θBDT − θ0)′

∑g∈G0

∂ρ∗F (θ0, g)

∂θ

∂ρ∗F (θ0, g)

∂θ′µ(g) + op(1)

(θBDT − θ0)

≥ c‖θBDT − θ0‖2, (C.26)

where the first equality holds by a mean-value expansion with θT being a point on the line segment

joining θBDT and θ0, the second equality holds by Assumption C.2(a), and the inequality holds by

Assumption C.2(b) where c is the smallest eigenvalue of∑

g∈G0

∂ρ∗F (θ0,g)∂θ

∂ρ∗F (θ0,g)∂θ′ µ(g)/2. This shows

(C.20).

Finally, we show (C.21). Observe that:

Q0,T (θBDT )−Q0(θBDT ) =∑g∈G0

[ρuT (θBDT , g)]2− − [ρ∗F (θBDT , g)]2−µ(g)

+∑g∈G0

[ρ`T (θBDT , g)]2− − [−ρ∗F (θBDT , g)]2−µ(g). (C.27)

Consider the derivation regarding the first summand in the right-hand-side of the equation above:∑G0

[ρuT (θBDT , g)]2− − [ρ∗F (θBDT , g)]2−dµ(g)

=∑g∈G0

([ρuT (θBDT , g)]− − [ρ∗F (θBDT , g)]−)2 + 2[ρ∗F (θBDT , g)]−([ρuT (θBDT , g)]− − [ρ∗F (θBDT , g)]−)µ(g)

≤∑g∈G0

(ρuT (θBDT , g)− ρ∗F (θBDT , g))2µ(g)

+ 2

∑g∈G0

ρ∗F (θBDT , g)2µ(g)

1/2∑g∈G0

(ρuT (θBDT , g)− ρ∗F (θBDT , g))2µ(g)

1/2

= Op((T J)−1) + 2Op((T J)−1/2)

∑g∈G0

ρ∗F (θBDT , g)2µ(g)

1/2

= Op((T J)−1) +Op((T J)−1/2)

(θBDT − θ0)∑g∈G0

∂ρ∗F (θT , g)

∂θ

∂ρ∗F (θT , g)

∂θ′µ(g)(θBDT − θ0)

1/2

= Op((T J)−1) +Op((T J)−1/2)‖θBDT − θ0‖, (C.28)

where the first equality holds by rearranging terms, the inequality holds by the fact that |[a]− −[b]−| ≤ |a − b| for any a, b ∈ R, and by the Cauchy-Schwarz Inequality, the second equality holds

by Assumptions C.1 and C.3 and Theorem 1, the third equality holds with θT being a point on

37

the line segment joining θBDT and θ0 by a mean-value expansion, and the last equality holds by

Assumption C.2 and Theorem 1. Similarly, we can show∑g∈G0

[ρ`T (θBDT , g)]2− − [−ρ∗F (θBDT , g)]2−µ(g) = Op((T J)−1) +Op((T J)−1/2)‖θBDT − θ0‖. (C.29)

Therefore, (C.21) is shown, and this concludes the proof of the theorem.

References

Anderson, C. (2006): The Long Tail: Why the Future of Business Is Selling Less of More.

Hyperion.

Andrews, D. W. K., and X. Shi (2013): “Inference Based on Conditional Moment Inequality

Models,” Econometrica, 81.

Berry, S. (1994): “Estimating discrete-choice models of product differentiation,” The RAND

Journal of Economics, pp. 242–262.

Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile prices in market equilibrium,”

Econometrica: Journal of the Econometric Society, pp. 841–890.

Berry, S., J. Levinsohn, and A. Pakes (2004): “Differentiated Products Demand Systems

from a Combination of Micro and Macro Data: The New Vehicle Market,” Journal of Political

Economy, 112, 68–104.

Berry, S., O. Linton, and A. Pakes (2004): “Limit theorems for estimating the parameters of

differentiated product demand systems,” Review of Economic Studies, 71(3), 613–654.

Berry, S. T., and P. A. Haile (2014): “Identification in differentiated products markets using

market level data,” Econometrica, 82(5), 1749–1797.

Chevalier, J. A., A. K. Kashyap, and P. E. Rossi (2003): “Why Don’t Prices Rise During

Periods of Peak Demand? Evidence from Scanner Data,” American Economic Review, 93(1),

15–37.

Dezhbakhsh, H., P. H. Rubin, and J. M. Shepherd (2003): “Does capital punishment have a

deterrent effect? New evidence from postmoratorium panel data,” American Law and Economics

Review, 5(2), 344–376.

Dube, J.-P., J. T. Fox, and C.-L. Su (2012): “Improving the numerical performance of static

and dynamic aggregate discrete choice random coefficients demand estimation,” Econometrica,

80(5), 2231–2267.

Freyberger, J. (2015): “Asymptotic theory for differentiated products demand models with

many markets,” Journal of Econometrics, 185(1), 162–181.

38

Gabaix, X. (1999a): “Zipf’s Law and the Growth of Cities,” The American Economic Review,

Papers and Proceedings, 89, 129–132.

Gandhi, A., Z. Lu, and X. Shi (2013): “Estimating Demand for Differentiated Products with

Error in Market Shares,” CeMMAP working paper.

Goolsbee, A., and A. Petrin (2004): “The consumer gains from direct broadcast satellites and

the competition with cable TV,” Econometrica, 72(2), 351–381.

Head, K., T. Mayer, et al. (2013): “Gravity equations: Workhorse, toolkit, and cookbook,”

Handbook of international economics, 4.

Hosken, D., and D. Reiffen (2004): “Patterns of retail price variation,” RAND Journal of

Economics, pp. 128–146.

Jaynes, E. T. (2003): Probability Theory: The Logic of Science. Cambridge University Press, 1st

edn.

Kahn, S., and E. Tamer (2009): “Inference on Randomly Censored Regression Models Using

Conditional Moment Inequalities,” Journal of Econometrics, 152, 104–119.

Nevo, A., and K. Hatzitaskos (2006): “Why does the average price paid fall during high demand

periods?,” Discussion paper, CSIO working paper.

Nurski, L., and F. Verboven (2016): “Exclusive Dealing as a Barrier to Entry? Evidence from

Automobiles,” The Review of Economic Studies, 83(3), 1156.

Quan, T. W., and K. R. Williams (2015): “Product Variety, Across-market Demand Hetero-

geneity, And The Value Of Online Retail,” Working Paper.

39

Online Appendix to “Estimating Demand forDifferentiated Products with Zeroes in Market Share

Data”

In this online appendix, we introduce the profiling approach for models defined by many moment

inequalities. The profiling approach developed here is similar to the penalized resampling approach

in Bugni, Canay, and Shi (2016) for unconditional moment inequality models. Section D describes

the profiling approach and gives the formal results, and Section E presents the proofs of those

results.

D The Profiling Approach

The profiling approach applies to general moment inequality models with many moment inequali-

ties. Thus from this point on, we focus on the moment inequality model:

Eρ(wt, θ, g) ≥ 0 for all g ∈ G, (D.1)

where ρ takes values in Rk. We also let G be a general set of indices that can be either countable

or uncontable. Let µ : G → [0, 1] denote a probability density on G. We assume the data wtTt=1 are

i.i.d. across t.

We assume that there is a parameter of interest, γ0, that is related to θ0 through:

γ0 ∈ Γ(θ0) ⊆ Rdγ , (D.2)

where Γ : Θ → 2Rdγ

is a known mapping where 2Rdγ

denotes the collection of all subsets of Rdγ .

Three examples of Γ are given below:

Example. Γ(θ) = α: γ0 is the price coefficient α0. In the simple logit model, the price coefficient

is all one needs to know to compute the demand elasticity.

Example. Γ(θ) = ej(p, π, θ, x) = (αpj/πj)(∂σj(σ−1(π, x, λ), x, λ)/∂δj): γ0 is the own-price de-

mand elasticity of product j at a given value of the price vector p, the choice probability vector π

and the covariates x.

Example. Γ(θ) = ej(p, π, θ, x) : π ∈ [πl, πu]: γ0 is the demand elasticity of product j at a given

value of the price vector p, the covariates x and at the choice probability vector that is known to lie

between πl and πu. This example is particularly useful when the elasticity depends on the choice

probability but the choice probability is only known to lie in an interval.

Let Γ0 be the identified set of γ0: Γ0 = γ ∈ Rdγ : ∃θ ∈ Θ0 s.t. Γ(θ) 3 γ, where Θ0 = θ ∈ Θ :

Eρ(wt, θ, g) ≥ 0 ∀g ∈ G. The profiling approach constructs a confidence set for γ0 by inverting a

40

test of the hypothesis:

H0 : γ ∈ Γ0, (D.3)

for each parameter value γ. The confidence set is the collection of values that are not rejected by

the test.

Let Γ−1(γ) = θ ∈ Θ : Γ(θ) 3 γ. The test to be inverted uses the profiled test statistic:

TT (γ) = T × minθ∈Γ−1(γ)

QT (θ), (D.4)

where QT (θ) is an empirical measure of the violation to the moment inequalities. The confidence

set of confidence level p is the set of all points for which the test statistic does not exceed a critical

value cT (γ, p):

CST = γ ∈ Rdγ : TT (γ) ≤ cT (γ, p). (D.5)

Notice that the new confidence set only involves computing a dγ-dimensional level set, where dγ is

often 1. The profiling transfers the burden of searching (for low values) over the surface of the non

smooth function T (θ)− c(θ) to searching over the surface of the typically smooth and often convex

function QT (θ).

We choose a critical value, cT (γ, p), of significance level 1− p ∈ (0, 0.5), to satisfy

limT→∞

inf(γ,F )∈H0

Pr F (TT (γ) > cT (γ, p)) ≤ 1− p, (D.6)

where F is the distribution on (wt)Tt=1 and H0 is the null parameter space of (γ, F ). The definition

of H0 along with other technical assumptions are given in Section D.4.19

As a result of (D.6), the confidence set asymptotically has the correct minimum coverage prob-

ability:

lim infT→∞

inf(γ,F )∈H0

PrF (γ ∈ CST ) ≥ p. (D.7)

The left hand side is called the “asymptotic size” of the confidence set in Andrews and Shi (2013).

We achieve the asymptotic size control by deriving an asymptotic approximation for the distribution

of the profiled test statistic TT (γ) that is uniformly valid over (γ, F ) ∈ H0 and simulating the

critical value from the approximating distribution through either a subsampling or a bootstrapping

procedure.

In the next subsections, we describe the test statistic and the critical value in detail and show

that (D.7) holds.

19Note that we use F to denote the distribution of the full observed data vector and thus (γ, F ) captures everything

unknown in the expression PrF (TT (γ) > cT (γ, p)). This notation differs from the traditional literature where the truedistribution of the data is often indicated by the true value of θ, but is standard in the recent partial identificationliterature. See Romano and Shaikh (2008) and Andrews and Shi (2013).

41

D.1 Test Statistic

The test statistic is the QLR statistic (i.e. a criterion-function-based statistic)20

TT (γ) = T × minθ∈Γ−1(γ)

QT (θ) with

QT (θ) =

∫GTS(ρT (θ, g), Σι

T (θ, g))dµ(g), (D.8)

where GT is a truncated/simulated version of G such that GT ↑ G as T → ∞, µ(·) is a probability

measure on G, S(m,Σ) is a real-valued function that measures the discrepancy of m from the

inequality restriction m ≥ 0, and

ρT (θ, g) = T−1T∑t=1

ρ(wt, θ, g),

ΣιT (θ, g) = ΣT (θ, g) + ι× ΣT (θ, 1)

ΣT (θ, g) = T−1T∑t=1

ρ(wt, θ, g)ρ(wt, θ, g)′ − ρT (θ, g)ρT (θ, g)

′. (D.9)

In the above definition, ι is a small positive number which is used because in some form of S defined

in Section D.4, the inverse of ΣιT (θ, g)’s diagonal elements enter, and the ι prevents us from taking

inverse of zeros. In some other forms of S, e.g. the one defined below and used in the simulation

and empirical section of this paper, the ι does not enter the test statistic because S(m,Σ) does not

depend on Σ.

Section D.4 gives the assumptions that the user-chosen quantities S, µ, G and GT should

satisfy. Under those assumptions, we can show that minθ∈Γ−1(γ) QT (θ) consistently estimate

minθ∈Γ−1(γ)QF (θ) where

QF (θ) =

∫GS(ρF (θ, g),Σι

F (θ, g))dµ(g), with

ρF (θ, g) = EF (ρ(wt, θ, g)),

ΣF (θ, g) = CovF (ρ(wt, θ, g)) and ΣιF (θ, g) = ΣF (θ, g) + ιΣF (θ, 1). (D.10)

The symbols “EF ” and “CovF ” denote expectation and covariance under the data distribution F

respectively. Notice that Γ0 depends on F . We make this explicit by changing the notation Γ0 to

Γ0,F for the rest of this paper.

We can also show that minθ∈Γ−1(γ)QF (θ) = 0 if and only if γ ∈ Γ0,F . This result combined

with the consistency of minθ∈Γ−1(γ) QT (θ) implies that TT (γ) diverges to infinity at γ /∈ Γ0,F . That

20Note that we do not follow the traditional QLR test exactly to define TT (γ) = T × minθ∈Γ−1(θ) QT (θ) − T ×minθ∈Θ QT (θ). This is because the validity of our critical value depends on certain monotonicity of the asymptoticapproximation of the test statistic and the monotonicity does not hold with this alternative test statistic due to thesubtraction of T ×minθ∈Θ QT (θ).

42

implies that there is no information loss in using such a test statistic. Lemma D.1 summarizes those

two results. The parameter space H of (γ, F ) appearing in the lemma is defined in Assumption

D.2 in Section D.4.

Lemma D.1. Suppose that Assumptions D.1, D.2, D.4, D.5(a), and D.6(a) and (d) hold. Then for

any (γ, F ) ∈ H,

(a) minθ∈Γ−1(γ) QT (θ)→p minθ∈Γ−1(γ)QF (θ) under F , and

(b) minθ∈Γ−1(γ)QF (θ) ≥ 0 and = 0 if and only if γ ∈ Γ0,F .

In the simulation and the empirical application of this paper, the following choices of S, G, GT and µ

are used mainly for computational convenience. For G, we use the one defined in (4.19). For GT ,

the truncated version of G, we define it to be the same as G except that we let r run from r0 to rT

where rT →∞ as T →∞ in the definition.

For S , we use

S(m,Σ) =k∑j=1

[mj ]2−, (D.11)

where mj is the jth coordinate of m and [x]− = |minx, 0|. There may be efficiency loss from not

weighting the moments using the variance matrix, but this S function brings great computational

convenience because it makes the minimization problem in (D.4) a convex one. For µ(·), we use

µ(ga,r,ζ) ∝ (100 + r)−2(2r)−dzcK−1d for g ∈ Gd,cc, (D.12)

where Kd is the number of elements in Zd. The same µ measure is used and seems to work well in

Andrews and Shi (2013).

D.2 Critical Value

We propose two types of critical values, one based on standard subsampling and the other based

on a bootstrapping procedure with moment shrinking. Both are simple to compute. The bootstrap

critical value may have better small sample properties, and is the procedure we use in the empirical

section.21 It is worth noting that we resample at the market level for both the subsampling and

the bootstrap.

Let us formally define the subsampling critical value first. It is obtained through the standard

subsampling steps: [1] from 1, ..., T, draw without replacement a subsample of market indices

of size bT ; [2] compute TT,bT (γ) in the same way as TT (γ) except using the subsample of markets

corresponding to the indices drawn in [1] rather than the original sample; [3] repeat [1]-[2] ST times

obtain ST independent (conditional on the original sample) copies of TT,bT (γ); [4] let c∗sub (γ, p) be

the p quantile of the ST independent copies. Let the subsampling critical value be

csubT (γ, p) = c∗sub (γ, p+ η∗) + η∗, (D.13)

21The bootstrap procedure here, like in most problems with partial identification, does not lead to higher-orderimprovement.

43

where η∗ > 0 is an infinitesimal number. The infinitesimal number is used to avoid making hard-

to-verify uniform continuity and strict monotonicity assumptions on the distribution of the test

statistic. It can be set to zero if one is willing to make the continuity assumptions. Such infinitesimal

numbers are also employed in Andrews and Shi (2013). One can follow their suggestion of using

η∗ = 10−6.

Let us now define the bootstrap critical value. It is obtained through the following steps: [1]

from the original sample 1, ..., T, draw with replacement a bootstrap sample of size T ; denote

the bootstrap sample by t1, ..., tT , [2] let the bootstrap statistic be

T ∗T (γ) = minθ∈Θ:γ∈Γ(θ)

∫GS(ν∗T (θ, g) + κ

1/2T ρT (θ, g), Σι

T (θ, g))dµ(G), , (D.14)

where ν∗T (θ, g) =√T (ρ∗T (θ, g) − ρT (θ, g)), ρ∗T (θ, g) = T−1

∑Tτ=1 ρ(Xtτ , θ, g), and κT is a sequence

of moment shrinking parameters: κT /T + κ−1T → 0; [3] repeat [1]-[2] ST times and obtain ST

independent (conditional on the original sample) copies of T ∗T (γ); [4] let c∗bt(γ, p) be the p quantile

of the ST copies. Let the bootstrap critical value be

cbtT (γ, p) = c∗bt(γ, p+ η∗) + η∗, (D.15)

where η∗ > 0 is an infinitesimal number which has the same function as in the subsampling critical

value above.

Critical values that are not based on resampling are possible, too. For example, one can define

a critical value similar to the bootstrap one, except with ν∗T (θ, g) replaced by a Gaussian process

with covariance kernel that equals the sample covariance of ρ(wt, θ(1), g(1)) and ρ(wt, θ

(2), g(2)) for

(θ(j), g(j)) ∈ Θ× G, j = 1, 2. For lack of space, we do not discuss such critical values in detail.

D.3 Coverage Probability

We show that the confidence sets defined in (D.5) using either csubT (γ, p) and cbtT (γ, p) have asymp-

totically correct coverage probability uniformly over H0 under appropriate assumptions. The as-

sumptions are given in Section D.4.

Theorem D.1. Suppose that Assumptions D.1-D.3 and D.5-D.7 hold, then

(a) (D.7) holds with cT (γ, p) = csubT (γ, p), and

(b) (D.7) holds with cT (γ, p) = cbtT (γ, p).

D.4 Assumptions

In this section, we list all the technical assumptions required for the profiling approach. The

assumptions are grouped into seven categories. Assumption D.1 restricts the space of θ; Assumption

D.2 restricts the space of (γ, F ), i.e. the parameters that determines the true data generating

process. Assumption D.3 further restricts the space of (γ, F ) to satisfy the null hypothesis γ ∈ Γ0.

Assumption D.4 is the full support condition on the measure µ on G. Assumption D.5 regulates how

44

GT approaches G as T increases. Assumption D.6 restricts the function S(m,Σ) to satisfy certain

continuity, monotonicity and convexity conditions. Assumption D.7 regulates the subsample size

bT and the moment shrinking parameter κT in the bootstrap procedure. Throughout, we let E∗

and E∗ denote outer and inner expectations respectively and Pr∗ and Pr∗ denote outer and inner

probabilities.

Assumption D.1. (a) Θ is compact, (b) Γ is upper hemi-continuous, and (c) Γ−1(γ) is either

convex or empty for any γ ∈ Rdγ .

To introduce Assumption D.2 we need the following extra notation. Let νF (θ, g) : (θ, g) ∈ Θ×Gdenote a tight Gaussian process with covariance kernel

ΣF (θ(1), g(1), θ(2), g(2)) = CovF

(ρ(wt, θ

(1), g(1)), ρ(wt, θ(2), g(2))

). (D.16)

Notice that ΣF (θ, g) = ΣF (θ, g, θ, g).

Let the derivative of ρF (θ, g) with respect to θ be GF (θ, g).

For any γ ∈ Rdγ , let the set Θ0,F (γ) be

Θ0,F (γ) = θ ∈ Θ : QF (θ) = 0 & Γ(θ) 3 γ, (D.17)

We call Θ0,F (γ) the zero-set of QF (θ) under (γ, F ). Note that for any γ ∈ Rdγ , γ ∈ Γ0,F if and

only if Θ0,F (γ) 6= ∅.Let the distance from a point to a set be the usual mapping:

d(a,A) = infa∗∈A

‖a− a∗‖, (D.18)

where ‖ · ‖ is the Euclidean distance.

Let F denote the set of all probability measures on (wt)Tt=1. Let G = G ∪ 1. Let M denote

the set of all positive semi-definite k× k matrices. The following assumption defines the parameter

space H for the pair (γ, F ).

Assumption D.2. The parameter space H of the pairs (γ, F ) is a subset of Rdγ ×F that satisfies:

(a) under every F such that (γ, F ) ∈ H for some γ ∈ Rdγ , the markets are independent and ex

ante identical to each other, i.e. ρ(wt, θ, g)Tt=1 is an i.i.d. sample for any θ, g;

(b) limM→∞ sup(γ,F )∈HE∗F [sup(θ,g)∈Γ−1(γ)×G ||ρ(wt, θ, g)||21||ρ(wt, θ, g)||2 > M] = 0;

(c) the class of functions ρ(wt, θ, g) : (θ, g) ∈ Γ−1(γ) × G is F -Donsker and pre-Gaussian

uniformly over H;

(d) the class of functions ρ(wt, θ, g)ρ(wt, θ, g)′

: (θ, g) ∈ Γ−1(γ) × G is Glivenko-Cantelli

uniformly over H;

(e) ρF (θ, g) is differentiable with respect to θ ∈ Θ, and there exists constants C and δ1 > 0 such

that, for any (θ(1), θ(2)), sup(γ,F )∈H,g∈G ||vec(GF (θ(1), g))− vec(GF (θ(2), g))|| ≤ C × ||θ(1) − θ(2)||δ1,

and

45

(f) ΣιF (θ, g) ∈ Ψ for all (γ, F ) ∈ H and θ ∈ Γ−1(γ) where Ψ is a compact subset of M, and

vec(ΣF (·, g(1), ·, g(2))) : (Γ−1(γ))2 → Rk2

: (γ, F ) ∈ H, g(1), g(2) ∈ G are uniformly bounded and

uniformly equicontinuous.

Remark. Part (a) is the i.i.d. assumption, which can be replaced with appropriate weak dependence

conditions at the cost of more complicated derivation in the uniform weak convergence of the

bootstrap empirical process. Part (b) is standard uniform Lindeberg condition. Part (c)-(d) imposes

restrictions on the complexity of the set G as well as on the shape of ρ(wt, θ, g) as a function of

θ. A sufficient condition is (i) ρ(wt, θ, g) is Lipschitz continuous in θ with the Lipschitz coefficient

being integrable and (ii) the set C in the definition of G forms a Vapnik-Cervonenkis set and Jt is

bounded. The Lipschitz continuity is also a sufficient condition of part (f).

The following assumptions defines the null parameter space, H0, for the pair (γ, F ).

Assumption D.3. The null parameter space H0 is a subset of H that satisfies:

(a) for every (γ, F ) ∈ H0, γ ∈ Γ0,F , and

(b) there exists C, c > 0 and 2 ≤ δ2 < 2(δ1 + 1) such that QF (θ) ≥ C · (d(θ,Θ0,F (γ))δ2 ∧ c) for

all (γ, F ) ∈ H0 and θ ∈ Γ−1(γ).

Remark. Part (b) is an identification strength assumption. It requires the criterion function to

increase at certain minimum rate as θ is perturbed away from the identified set. This assumption

is weaker than the quadratic minorant assumption in Chernozhukov, Hong, and Tamer (2007) if

δ2 > 2 and as strong as the latter if δ2 = 2. Putting part (b) and Assumption D.2(e) together,

we can see that there is a trade-off between the minimum identification strength required and

the degree of Hı¿œlder continuity of the first derivative of ρF (·, g). If ρF (·, g) is linear, δ2 can be

arbitrarily large – the criterion function can increase very slowly as θ is perturbed away from the

identified set.

The following assumption is on the measure µ. For any θ, let a pseudo-metric on G be: ||g(1)−g(2)||θ,F = ||ρF,j(θ, g(1))− ρF,j(θ, g(2))||. This assumption is needed for Lemma D.1 and not needed

for the asymptotic size result Theorem D.1.

Assumption D.4. For any θ ∈ Θ, µ(·) has full support on the metric space (G, || · ||θ,F ).

Remark. Assumption D.4 implies that for any θ ∈ Θ, F and j, if ρF,j(θ, g0) < 0 for some g0 ∈ G,

then there exists a neighborhood N (g0) with positive µ-measure such that ρF,j(θ, g) < 0 for all

g ∈ N (g0).

The following assumption is on the set GT .

Assumption D.5. (a) GT ↑ G as T →∞ and

(b) lim supT→∞ sup(γ,F )∈H0supθ∈Γ−1(γ)

∫G\GT S(

√TρF (θ, g),ΣF (θ, g))dµ(g) = 0.

The following assumptions are imposed on the function S. For a ξ > 0, let the ξ-expansion of

Ψ be Ψξ = Σ ∈M : infΣ1∈Ψ ||vech(Σ)− vech(Σ1)|| ≤ ξ.

46

Assumption D.6. (a) S(m,Σ) : (−∞,∞]k ×Ψξ → R is continuous for some ξ > 0.

(b) There exists a constant C > 0 and ξ > 0 such that for any m1,m2 ∈ Rk and Σ1,Σ2 ∈ Ψξ,

we have |S(m1,Σ1) − S(m2,Σ2)| ≤ C√

(S(m1,Σ1) + S(m2,Σ2))(S(m2,Σ2) + 1)∆, where ∆ =

||m1 −m2||2 + ||vech(Σ1 − Σ2)||.(c) S is non-increasing in m.

(d) S(m,Σ) ≥ 0 and S(m,Σ) = 0 if and only if m ∈ [0,∞]k.

(e) S is homogeneous in m of degree 2.

(f) S is convex in m ∈ Rdm for any Σ ∈ Ψξ.

Remark. We show in the lemma below that Assumption D.6 is satisfied by the example in (D.11)

(which is used in our empirical section) as well as the SUM and MAX functions in Andrews and

Shi (2013):

SUM: S(m,Σ) =k∑j=1

[mj/σj ]2−, and

MAX: S(m,Σ) = max1≤j≤k

[mj/σj ]2−, (D.19)

where σ2j is the jth diagonal element of Σ. Assumptions D.6(b) and (f) rule out the QLR function

in Andrews and Shi (2013): S(m,Σ) = mint≥0(m− t)′Σ−1(m− t).

Lemma D.2. (a) Assumption D.6 is satisfied by the S function in (D.11) for any set Ψ.

(b) Assumption D.6 is satisfied by the SUM and the MAX functions in (D.19) if Ψ is a compact

subset of the set of positive semi-definite matrix with diagonal elements bounded below by some

constant ξ2 > 0.

The following assumptions are imposed on the tuning parameters in the subsampling and the

bootstrap procedures.

Assumption D.7. (a) In the subsampling procedure, b−1T + bTT

−1 → 0 and ST →∞, and

(b)In the bootstrap procedure, κ−1T + κTT

−1 → 0 and ST →∞.

D.5 Proof of Lemmas D.1 and D.2

Proof of Lemma D.1. (a) Assumptions D.2(c)-(d) imply that under F ,

∆ρ,T ≡ supθ∈Γ−1(γ),g∈G

||ρT (θ, g)− ρF (θ, g)|| →p 0, and

supθ∈Γ−1(γ),g∈G

||vech(ΣT (θ, g)− ΣF (θ, g))|| →p 0. (D.20)

The second convergence implies that

∆Σ,T ≡ supθ∈Γ−1(γ),g∈G

||vech(ΣιT (θ, g)− Σι

F (θ, g))|| →p 0. (D.21)

47

By Assumption D.2(b), supθ∈Γ−1(γ),g∈G ||ρF (θ, g)|| < M∗ for someM∗ <∞. Thus, (ρF (θ, g),ΣιF (θ, g)) :

(θ, g) ∈ Γ−1(γ) × G is a subset of the compact set [−M∗,M∗]k × Ψ. By Assumption D.2(f) and

Equations (D.20) and (D.21), we have (ρT (θ, g), ΣιT (θ, g)) : (θ, g) ∈ Γ−1(γ) × G ⊆ [−M∗ −

ξ,M∗ + ξ]k ×Ψξ with probability approaching one for any ξ > 0. By Assumption D.6(a), S(m,Σ)

is uniformly continuous on [−M∗,M∗]k ×Ψ. Therefore, for any ε > 0,

Pr F

(∣∣∣∣ minθ∈Γ−1(γ)

QT (θ)− minθ∈Γ−1(γ)

∫GTS(ρF (θ, g),Σι

F (θ, g))dµ(g)

∣∣∣∣ > ε

)≤Pr F

(sup

θ∈Γ−1(γ),g∈G|S(ρt(θ, g), Σι

t(θ, g))− S(ρF (θ, g),ΣιF (θ, g))| > ε

)→0. (D.22)

Now it is left to show that minθ∈Γ−1(γ)

∫GT S(ρF (θ, g),Σι

F (θ, g))dµ(g)→ minθ∈Γ−1(γ)QF (θ) as T →∞. Observe that

0 ≤ minθ∈Γ−1(γ)

QF (θ)− minθ∈Γ−1(γ)

∫GTS(ρF (θ, g),Σι

F (θ, g))dµ(g)

≤ supθ∈Γ−1(γ)

∫G/GT

S(ρF (θ, g),ΣιF (θ, g))dµ(g)

≤∫G/GT

supθ∈Γ−1(γ)

S(ρF (θ, g),ΣιF (θ, g))dµ(g). (D.23)

We have supθ∈Γ−1(γ) S(ρF (θ, g),ΣιF (θ, g)) < ∞, because ρF (θ, g) ∈ [−M∗,M∗]k and Σι

F (θ, g) ∈ Ψ

and Assumption D.6(a). Thus the last line of (D.23) converges to zero under Assumption D.5(a).

This and (D.22) together show part (a).

(b) The first half of part (b), minθ∈Γ−1(γ)QF (θ) ≥ 0, is implied by Assumption D.6(d).

Suppose γ ∈ Γ0,F . Then there exists a θ∗ ∈ Γ−1(γ) such that ρF (θ∗, g) ≥ 0 for all g ∈ G. This

implies that S(ρF (θ∗, g),ΣF (θ∗, g)) = 0 for all g ∈ G by Assumption D.6(d). Thus, QF (θ∗) = 0.

Because minθ∈Γ−1(γ)QF (θ) ≤ QF (θ∗) = 0, this shows the “if” part of the second half.

Suppose that minθ∈Γ−1(γ)QF (θ) = 0. By Assumptions D.1(a)-(b), Γ−1(γ) is compact. By

Assumptions D.2(e) and (f), QF (θ) is continuous in θ. Thus, there exists a θ∗ ∈ Γ−1(γ) such that

QF (θ∗) = minθ∈Γ−1(γ)QF (θ) = 0. We show by contradiction that this implies γ ∈ Γ0,F . Suppose

that γ /∈ Γ0,F . Then it must be that θ∗ /∈ Θ0,F , which implies that ρF,j(θ∗, g∗) < 0 for some g∗ ∈ G

and some j ≤ dm. Then by Assumption D.4, there exists a neighborhood N (g∗) with positive

µ-measure, such that ρF,j(θ∗, g) < 0 for all g ∈ N (g∗). This implies that QF (θ∗) > 0, which

contradicts QF (θ∗) = 0. Thus, the “only if” part is proved.

Proof of Lemma D.2. We prove part (b) only. Part (a) follows from the arguments for part (b)

because the S function in part (a) is the same as the SUM S function with Σ = I. Let ξ be any

positive number less than ξ2. Then the diagonal elements of all matrices in Ψξ are bounded below

by ξ2 − ξ.

48

We prove the SUM part first. Assumptions D.6(a), (c)-(f) are immediate. It suffices to verify

Assumptions D.6(b). To verify Assumption D.6(b), observe that

|S(m1,Σ1)− S(m2,Σ2)| =

∣∣∣∣∣∣k∑j=1

([m1,j/σ1,j ]− − [m2,j/σ2,j ]−)([m1,j/σ1,j ]− + [m2,j/σ2,j ]−)

∣∣∣∣∣∣≤

2k∑j=1

([m1,j/σ1,j ]− − [m2,j/σ2,j ]−)2(S(m1,Σ1) + S(m2,Σ2))

1/2

≡2A(S(m1,Σ1) + S(m2,Σ2))1/2 , (D.24)

where the inequality holds by the Cauchy-Schwartz inequality and the inequality (a+ b)2 ≤ 2(a2 +

b2), and the ≡ holds with A :=∑k

j=1([m1,j/σ1,j ]− − [m2,j/σ2,j ]−)2. Now we manipulate A in the

following way:

A =

k∑j=1

([m1,j/σ1,j ]− − [m2,j/σ1,j ]− + [m2,j/σ1,j ]− − [m2,j/σ2,j ]−)2

≤ 2

k∑j=1

([m1,j/σ1,j ]− − [m2,j/σ1,j ]−)2 + 2

k∑j=1

([m2,j/σ1,j ]− − [m2,j/σ2,j ]−)2

= 2

k∑j=1

([m1,j/σ1,j ]− − [m2,j/σ1,j ]−)2 + 2

k∑j=1

(σ2,j − σ1,j)2[m2,j/σ2,j ]

2−/σ

21,j

≤ 2‖m1 −m2‖2/(ξ2 − ξ) + 2‖vech(Σ1 − Σ2)‖/(ξ2 − ξ)S(m2,Σ2)

≤ 2(ξ2 − ξ)−1(S(m2,Σ2) + 1)(||m1 −m2||2 + ||vech(Σ1 − Σ2)||), (D.25)

where the first inequality holds by the inequality (a + b)2 ≤ 2(a2 + b2) and the second inequality

holds because (σ2,j − σ1,j)2 ≤ |σ2

2,j − σ21,j | ≤ ||vech(Σ1 −Σ2)|| and because σ2

1,j , σ22,j ≥ ξ2 − ξ. Plug

(D.25) in (D.24), and we obtain Assumptions D.6(b).

The proof for the MAX part is the same as the SUM part except some minor changes. The

first and obvious change is to replace all∑k

j=1 involved in the above arguments by maxj=1,...,k .

The second change is to replace the Cauchy-Schwartz inequality used in (D.24) by the inequality

|maxj ajbj | ≤ (maxj a2j ×maxj b

2j )

1/2. The rest of the arguments stay unchanged.

E Proof of Theorem D.1

We first introduce the approximation of TT (γ) that connects the distribution of TT (γ) with those

of the subsampling statistic and the bootstrap statistic. For any θ ∈ Θ0,F (γ), let ΛT (θ, γ) =

λ : θ + λ/√T ∈ Γ−1(γ), d(θ + λ/

√T ,Θ0,F (γ)) = ||λ||/

√T. In words, ΛT (θ, γ) is the set of all

deviations from θ along the fastest paths away from Θ0,F (γ). With this notation handy, we can

49

define the approximation of TT (γ) as follows:

T apprT (γ) = (E.1)

minθ∈Θ0,F (γ)

minλ∈ΛT (θ,γ)

∫GS(νF (θ, g) +GF (θ, g)λ+

√TρF (θ, g),Σι

F (θ, g))dµ(g).

Theorem E.1 shows that T apprT (γ) approximates TT (γ) asymptotically.

Theorem E.1. Suppose that Assumptions D.1-D.3 and D.5-D.6 hold. Then for any real sequence

xT and scalar η > 0 ,

lim infT→∞

inf(γ,F )∈H0

[Pr F (TT (γ) ≤ xT + η)− Pr(T apprT (γ) ≤ xT )

]≥ 0 and

lim supT→∞

sup(γ,F )∈H0

[Pr F (TT (γ) ≤ xT )− Pr(T apprT (γ) ≤ xT + η)

]≤ 0.

Theorem E.1 is a key step in the proof of Theorem D.1 and is proved in the next sub-subsection.

The remaining proof of Theorem D.1 is given in the subsection after that.

E.1 Proof of Theorem E.1

The following lemma is used in the proof of Theorem E.1. It is a portmanteau theorem for uniform

weak approximation, which is an extension of the portmanteau theorem for (pointwise) weak con-

vergence in Chapter 1.3 of van der Vaart and Wellner (1996). Let (D, d) be a metric space and let

BL1 denote the set of all real functions on D with a Lipschitz norm bounded by one.

Lemma E.1. (a) Let (Ω,B) be a measurable space. Let X(1)T : Ω → D and X(2)

T : Ω → D be

two sequences of mappings. Let P be a set of probability measures defined on (Ω,B). Suppose that

supP∈P supf∈BL1|E∗P f(X

(1)T ) − E∗,P f(X

(2)T )| → 0. Then for any open set G0 ⊆ D and closed set

G1 ⊂ G0, we have

lim infT→∞

infP

[Pr ∗,P (X

(1)T ∈ G0)− Pr ∗P (X

(2)T ∈ G1)

]≥ 0 and

(b) Let (Ω,B) be a product space: (Ω,B) = (Ω1 × Ω2, σ(B1 × B2)). Let P1 be a set of prob-

ability measures defined on (Ω1,B1) and P2 be a probability measure on (Ω2,B2). Suppose that

supP1∈P1Pr ∗P1

(supf∈BL1|E∗P2

f(X(1)T ) − E∗,P2f(X

(2)T )| > ε) → 0 for all ε > 0. Then for any open

set G0 ⊆ D and closed set G0 ⊂ G1, we have for any ε > 0,

lim supT→∞

supP1∈P1

Pr ∗P1(Pr ∗P2

(X(1)T ∈ G1)− Pr ∗,P2(X

(2)T ∈ G0) > ε) = 0.

Proof of Lemma E.1. (a) We first show that there is a Lipschitz continuous function sandwiched

by 1(x ∈ G0) and 1(x ∈ G1). Let fa(x) = (a · d(x,Gc0)) ∧ 1, where Gc0 is the complement of G0.

50

Then fa is a Lipschitz function and fa(x) ≤ 1(x ∈ G0) for any a > 0. Because G1 is a closed subset

of G0, infx∈G1 d(x,Gc0) > c for some c > 0. Let a = c−1 + 1. Then fa(x) ≥ 1(x ∈ G1). Thus, the

function fa(x) is sandwiched between 1(x ∈ G0) and 1(x ∈ F1). Equivalently,

a−11(x ∈ G1) ≤ a−1fa(x) ≤ a−11(x ∈ G0), ∀x ∈ D. (E.2)

By definition, a−1fa(x) ∈ BL1. Using this fact and (E.2), we have

a−1 lim infT→∞

infP∈P

[Pr ∗,P (X

(1)T ∈ G0)− Pr ∗P (X

(2)T ∈ G1)

]= lim inf

T→∞infP∈P

[a−1 Pr ∗,P (X(1)T ∈ G0)− E∗,Pa−1fa(X

(1)T )+

E∗,Pa−1fa(X

(1)T )− E∗Pa−1fa(X

(2)T ) + E∗Pa

−1fa(X(2)T )− a−1 Pr ∗P (X

(2)T ∈ G1)]

≥ lim infT→∞

infP∈P

[E∗,Pa

−1fa(X(1)T )− E∗Pa−1fa(X

(2)T )]

= 0. (E.3)

Therefore, part (a) is established.

(b) Use the same a and fa(x) as above, we have

Pr ∗P2(X

(1)T ∈ G1)− Pr ∗,P2(X

(2)T ∈ G0) ≤ a

[E∗P2

a−1fa(X(1)T )− E∗,P2a

−1fa(X(2)T )]

≤ a supf∈BL1

|E∗,P2f(X(1)T )− E∗P2

f(X(2)T )|. (E.4)

This implies part (b).

Proof of Theorem E.1. We only need to show the first inequality because the second one follows

from the same arguments with TT (γ) and T apprT (γ) flipped.

The proof consists of four steps. In the first step, we show that the truncation of G has

asymptotically negligible effect: for all ε > 0,

lim supT→∞

sup(γ,F )∈H0

Pr F (|TT (γ)− TT (γ)| > ε) = 0, (E.5)

where TT (γ) is the same as TT (γ) except that the integral is over G instead of GT .

In the second step, we define a bounded version of TT (γ): TT (γ;B1, B2) and a bounded version

of T apprT (γ): T apprT (γ;B1, B2) and show that for any B1, B2 > 0 and any real sequence xT ,

lim infT→∞

inf(γ,F )∈H0

[Pr F (TT (γ;B1, B2) ≤ xT + η)− Pr(T apprT (γ;B1, B2) ≤ xT )

]≥ 0. (E.6)

In the third step, we show that TT (γ;B1, B2) is asymptotically close in distribution to TT (γ)

for large enough B1, B2: for any ε > 0, there exists B1,ε and B2,ε such that

lim supT→∞

sup(γ,F )∈H0

Pr F (TT (γ;B1,ε, B2,ε) 6= TT (γ)) < ε. (E.7)

51

In the fourth step, we show that T apprT (γ;B1, B2) is asymptotically close in distribution to T apprT (γ)

for large enough B1, B2: for any ε > 0, there exists B1,ε and B2,ε such that

lim supT→∞

sup(γ,F )∈H0

Pr F (T apprT (γ;B1,ε, B2,ε) 6= T apprT (γ)) < ε. (E.8)

The four steps combined proves the Theorem. Now we give detailed arguments of the four steps.

STEP 1. First we show a property of the function S that is useful throughout all steps: for

any (m1,Σ1) and (m2,Σ2) ∈ Rk ×Ψξ,

|S(m1,Σ1)− S(m2,Σ2)| ≤ C2 × (S(m2,Σ2) + 1)(∆ +√

∆2 + 8∆)/2, (E.9)

for the ∆ and C in Assumption D.6(b). Let ∆S := |S(m1,Σ1) − S(m2,Σ2)|. Assumption D.6(b)

implies that

∆2S ≤ C2 × (S(m1,Σ1) + S(m2,Σ2))(S(m2,Σ2) + 1)∆

≤ C2 × (∆S + 2S(m2,Σ2))(S(m2,Σ2) + 1)∆. (E.10)

Solve the quadratic inequality for ∆S , we have

∆S ≤C2

2

[(S(m2,Σ2) + 1)∆ +

√(S(m2,Σ2) + 1)2∆2 + 8S(m2,Σ2)(S(m2,Σ2) + 1)∆

]≤ C2

2(S(m2,Σ2) + 1)(∆ +

√∆2 + 8∆) (E.11)

This shows (E.9).

52

Now observe that

0 ≤ TT (γ)− TT (γ)


∫G/GT

S(√T ρT (θ, g), Σι

T (θ, g))dµ(g)


∫G/GT

S(√TρF (θ, g),Σι

F (θ, g))dµ(g)+

supθ∈Γ−1(γ)

∫G/GT

|S(√TρF (θ, g),Σι

F (θ, g))− S(√T ρT (θ, g), Σι

T (θ, g))|dµ(g)

= o(1) + supθ∈Γ−1(γ)

∫G/GT

|S(√TρF (θ, g),Σι

F (θ, g))− S(√T ρT (θ, g), Σι

T (θ, g))|dµ(g)

≤ o(1) + supθ∈Γ−1(γ)

∫G/GT

C2 ×(S(√TρF (θ, g),Σι

F (θ, g)) + 1)dµ(g)×

supθ∈Γ−1(γ),g∈G/GT

c(||νT (θ, g)||2 + ||vech(Σι

F (θ, g)− ΣιT (θ, g))||

)= o(1) + o(1)× c(Op(1))

= op(1), (E.12)

where c(x) = x+√x2 + 8x/2, the third inequality holds by the triangle inequality, the first equality

holds by Assumption D.5(b), the fourth inequality holds by (E.9) and the second equality holds by

Assumptions D.5(a)-(b) and D.2(c)-(d). The o(1), op(1) and Op(1) are uniform over (γ, F ) ∈ H.

Thus, (E.5) is shown.

STEP 2. We define the bounded versions of TT (γ) as

TT (γ;B1, B2) = minθ∈Θ0,F (γ)

minλ∈Λ

B2T (θ,γ)∫

GS(νB1

T (θ + λ/√T , g) +GF (θT , g)λ+

√TρF (θ, g), Σι

T (θ + λ/√T , g))dµ(g) (E.13)

where ΛB2T (θ, γ) = λ ∈ ΛT (θ, γ) : TQF (θ + λ/

√T ) ≤ B2, the process νB1

T (·, ·) = max−B1,

minB1, νT (·, ·) and θT is a value lying on the line segment joining θ and θ + λ/√T satisfying

the mean value expansion:

ρF (θ + λ/√T , g) = ρF (θ, g) +GF (θT , g)λ/

√T . (E.14)

Define the bounded version of T apprT (γ) as

T apprT (γ;B1, B2) = (E.15)

minθ∈Θ0,F (γ)

minλ∈Λ

B2T (θ,γ)

∫GS(νB1

F (θ, g) +GF (θ, g)λ+√TρF (θ, g),Σι

F (θ, g))dµ(g),

53

where νB1F (·, ·) = max−B1,minB1, νF (·, ·).

First we show a useful result: there exists some constant C > 0 such that for all (γ, F ) ∈ H0

and λ ∈ ΛB2T (θ, γ) and for the δ2 in Assumption D.3(b), we have

||λ|| ≤ C × T (δ2−2)/(2δ2). (E.16)

This is shown by observing, for all (γ, F ) ∈ H0 and λ ∈ ΛB2T (θ, γ),

B2 >TQF (θ + λ/√T )

≥C · ((T × d(θ + λ/√T ,Θ0,F (γ))δ2) ∧ (c× T )). (E.17)

The second inequality holds by Assumption (D.3)(b). Because c× T is eventually greater than B2

as T →∞, we have for large enough T ,

B2 ≥ C × T × (||λ||/√T )δ2 . (E.18)

This implies (E.16).

Equation (E.16) implies two results:

(1) sup(γ,F )∈H0

supθ∈Θ0,F (γ)

supλ∈Λ

B2T (θ,γ)

‖λ‖/√T ≤ O(T−1/δ2) = o(1)

(2) sup(γ,F )∈H0

supθ∈Θ0,F (γ)

supλ∈Λ

B2T (θ,γ)

supg∈G‖GF (θ +O(‖λ‖)/

√T , g)λ−GF (θ, g)λ‖

≤ O(1)× ‖λ‖δ1+1T−δ1/2 ≤ O(T (δ2−2(δ1+1))/(2δ2)) = o(1). (E.19)

The first result holds immediately given (E.16) and the second result holds by Assumption D.2(e).

Define an intermediate statistic

TmedT (γ;B1, B2) = minθ∈Θ0,F (γ)

minλ∈Λ

B2T (θ,γ)∫

GS(νB1

T (θ, g) +GF (θ, g)λ+√TρF (θ, g),Σι

F (θ, g))dµ(g). (E.20)

Then TmedT (γ;B1, B2) and T apprT (γ;B1, B2) are respectively the following functional evaluated at

νF (·, ·) and νT (·, ·):

h(ν) = minθ∈Θ0,F (γ)

minλ∈Λ

B2T (θ,γ)

∫GS(νB1(θ, ·) +GF (θ, ·)λ+

√TρF (θ, ·),Σι

F (θ, ·))dµ. (E.21)

The functional h(ν) is uniformly bounded for all large enough T because for any fixed θ ∈ Θ0,F (γ)

54

and λ ∈ ΛB2T (θ, γ),

h(ν) ≤ 2

∫GS(GF (θ, ·)λ+


F (θ, ·))dµ+ 2

∫GS(νB1(θ, ·),Σι

F (θ, ·))dµ

≤ 2 supΣ∈Ψ

S(−B11k,Σ) + 2

∫GS(GF (θ, ·)λ+


F (θ, ·))dµ

≤ 2 supΣ∈Ψ

S(−B11k,Σ) + 2T ×QF (θ + λ/√T )+

C2 × (T ×QF (θ + λ/√T ) + 1) sup

g∈G(∆T (g) +

√∆T (g)2 + 8∆T (g))

≤ 2 supΣ∈Ψ

S(−B11k,Σ) + 2B2 + C2(B2 + 1)× o(1), (E.22)

where ∆T (g) := ‖GF (θ, g)λ +√TρF (θ, g) −

√TρF (θT , g)‖2 + ‖vech(Σι

F (θ, g) − ΣιF (θT , g))‖ and

θT = θ + λ/√T . The first inequality holds by Assumptions D.6(e)-(f), the second inequality holds

by Assumptions D.2(f) and Assumptions D.6(c), the third inequality holds by (E.9) and the last

inequality holds by (E.19).

The functional h(ν) is Lipschitz continuous for all large enough T with respect to the uniform

metric because

|h(ν1)− h(ν2)| ≤ 2C supθ∈Θ0,F (γ)

supλ∈Λ

B2T (θ,γ)

supg∈G‖ν1(θ, g)− ν2(θ, g)‖ · (1 + h(ν1) + 2h(ν2))

≤ C supθ∈Γ−1(γ),g∈G

‖ν1(θ, g)− ν2(θ, g)‖, (E.23)

where C is any constant such that C > 2C× (6 supΣ∈Ψ S(−B11k,Σ) + 6B2 + 1), the first inequality

holds by Assumption D.6(b) and the second holds by (E.22).

Therefore, for any f ∈ BL1 and any real sequence xT , the composite function f (C−1h(·) +

xT ) ∈ BL1. By AssumptionD.2(c), we have

lim supT→∞

sup(γ,F )∈H0

supf∈BL1

|EF f(TmedT (γ;B1, B2) + xT )− Ef(T apprT (γ;B1, B2) + xT )| = 0. (E.24)

This combined with Lemma E.1(a) (with G0 = (−∞, η) and G1 = (−∞, 0]) gives

lim infT→∞

inf(γ,F )∈H0

[Pr F (TmedT (γ;B1, B2) ≤ xT + η)− Pr(T apprT (γ;B1, B2) ≤ xT )

]≥ 0. (E.25)

55

Now it is left to show that TmedT (γ;B1, B2) and TT (γ;B1, B2) are close. First, we have

|TT (γ;B1, B2)− TmedT (γ;B1, B2)|

≤ supθ∈Θ0,F (γ),λ∈Λ

B2T (θ,γ)

∫G

∣∣∣S(νB1T (θ + λ/

√T , g) +GF (θT , g)λ+


T (θ + λ/√T , g))

−S(νB1T (θ, g) +GF (θ, g)λ+


F (θ, g))∣∣∣ dµ(g)

≤C2 × supθ∈Θ0,F (γ),λ∈Λ

B2T (θ,γ)

maxg∈G

c(∆T (θ, λ, g))×∫G(1 +MT (θ, λ, g))dµ(g), (E.26)

where c(x) = (x+√x2 + 8x)/2, C is the constant in (E.9),

∆T (θ, λ, g) =‖νB1T (θ + λ/

√T , g)− νB1

T (θ, g) +GF (θT , g)λ−GF (θ, g)λ‖2+

‖vech(ΣT (θ + λ/√T , g)− ΣF (θ, g))‖ and

MT (θ, λ, g) =S(νB1T (θ, g) +GF (θ, g)λ+


F (θ, g)). (E.27)

Below we show that for any ε > 0, and some universal constant C > 0,

sup(γ,F )∈H0

Pr F

supθ∈Θ0,F (γ),λ∈Λ

B2T (θ,γ),g∈G

∆T (θ, λ, g) > ε

→ 0 and (E.28)

supT

sup(γ,F )∈H0


B2T (θ,γ)

∫GMT (θ, λ, g)dµ(g) < C. (E.29)

Once (E.28) and (E.29) are shown, it is immediate that for any ε > 0,

sup(γ,F )∈H0

Pr F

(|TT (γ;B1, B2)− TmedT (γ;B1, B2)| > ε

)→ 0. (E.30)

This combined with (E.25) shows (E.6).

Now we show (E.28) and (E.29). The convergence result (E.28) is implied by the following

results: for any ε > 0,

sup(γ,F )∈H0

Pr F


B2T (θ,γ),g∈G

||νB1T (θ + λ/

√T , g)− νB1

T (θ, g)|| > ε

→ 0

sup(γ,F )∈H0


B2T (θ,γ),g∈G

||GF (θT , g)λ−GF (θ, g)λ|| → 0 and

sup(γ,F )∈H0

Pr F


B2T (θ,γ),g∈G

||vech(ΣT (θ + λ/√T , g)− ΣF (θ, g))|| > ε

→ 0. (E.31)

The first result in the above display holds by the first result in equation (E.19) and the uniform

stochastic equicontinuity of the empirical process νT (·, g) : Γ−1(γ) → Rdm with respect to the

56

Euclidean metric. The uniform equicontinuity is implied by Assumptions D.2(b), (c) and (f) by

Theorem 2.8.2 of van der Vaart and Wellner (1996). The second result in the above display holds

by the second result in (E.19). The third result in (E.31) holds by Assumption D.2(d) and (f).

Result (E.29) holds because for any θ ∈ Θ0,F (γ) and λ ∈ ΛB2T (θ, γ),∫

GMT (θ, λ, g)dµ(g)

≤2

∫GS(νB1

T θ, g),ΣιF (θ, g))dµ(g) + 2

∫GS(GF (θ, g)λ+


F (θ, g))dµ(g)

≤ supΣ∈Ψ

S(−B11k,Σ) + 2

∫GS(GF (θ, g)λ+


F (θ, g))dµ(g)

≤ supΣ∈Ψ

S(−B11k,Σ) + 2B2 + C2(B2 + 1)× o(1), (E.32)

where the first inequality holds by Assumptions D.6(f), the second inequality holds by Assumption

D.6(c) and the last inequality holds by the second and third inequality in (E.22) and the o(1) is

uniform over (θ, λ).

STEP 3. In order to show (E.7), first extend the definition of TT (γ;B1, B2) from Step 1 to

allow B1 and B2 to take the value ∞ and observe that TT (γ;∞,∞) = TT (γ).

Assumptions D.2 (c) and Lemma E.1 imply that for any ε > 0, there exists B1,ε large enough

such that

lim supT→∞

sup(γ,F )∈H0

Pr F

(sup

θ∈Θ,g∈G‖νT (θ, g)‖ > B1,ε

)< ε. (E.33)

Therefore we have for all B2,

lim supT→∞

sup(γ,F )∈H0

Pr F(TT (γ,∞, B2) 6= TT (γ;B1,ε, B2)

)< ε. (E.34)

To show that T T (γ) and TT (γ;∞, B2) are close for B2 large enough, first observe that:

T T (γ) ≤ supθ∈Θ0,F (γ)

∫GS(νT (θ, g) +


T (θ, g))dµ(g)

≤ supθ∈Θ0,F (γ)

∫GS(νT (θ, g), Σι

T (θ, g))dµ(g)

= Op(1) (E.35)

where the first inequality holds because 0 ∈ ΛT (θ, γ), the second inequality holds because ρF (θ, g) ≥0 for θ ∈ Θ0,F (γ) and by Assumption D.6(c), the equality holds by Assumption D.6(a)-(c) and

Assumptions D.2 (c), (d) and (f). The Op(1) is uniform over (γ, F ) ∈ H0.

For any T , γ, B2, if T T (γ) 6= TT (γ;∞, B2), then there must be a θ∗ ∈ Γ−1(γ) such that

57

T ×QF (θ∗) > B2 and∫GS(νT (θ∗, g) +

√TρF (θ∗, g), Σι

T (θ∗, g))dµ(g) < Op(1). (E.36)

But ∫GS(νT (θ∗, g) +

√TρF (θ∗, g), Σι

T (θ∗, g))dµ(g)

≥2−1

∫GS(√TρF (θ∗, g), Σι

T (θ∗, g))dµ(g)−∫GS(−νT (θ∗, g), Σι

T (θ∗, g))dµ(g)

≥2−1

∫GS(√TρF (θ∗, g), Σι

T (θ∗, g))dµ(g)−Op(1)

≥2−1

[TQF (θ∗)−

∫G|S(√TρF (θ∗, ·), Σι

T (θ∗, ·))− S(√TρF (θ∗, ·),Σι

F (θ∗, ·))|dµ]−Op(1)

≥2−1

[TQF (θ∗)− C2 sup

g∈Gc(||vech(Σι

T (θ∗, g)− ΣιF (θ∗, g))||)× (1 + TQF (θ∗))

]−Op(1)

=B2/2− o(1)− op(1)× C2 ×B2/4−Op(1), (E.37)

where c(x) = (x +√x2 + 8x)/2 and C is the constant in (E.9). The first inequality holds by

Assumptions D.6(e)-(f), the second inequality holds by Assumption D.6(c) and Assumptions D.2(c)-

(d) and (f), the third inequality holds by the triangle inequality, the fourth inequality holds by (E.9)

and the equality holds by Assumption D.2(d). The terms o(1), op(1) and Op(1) terms are uniform

over θ∗ ∈ Γ−1(γ) and (γ, F ) ∈ H0.

Then

sup(γ,F )∈H0

Pr F

(T T (γ) 6= TT (γ;∞, B2)

)≤ sup

(γ,F )∈H0

Pr F(2−1(1− op(1))×B2 − o(1)−Op(1) ≤ Op(1)

)= sup

(γ,F )∈H0

Pr F (Op(1) ≥ B2) , (E.38)

where the first inequality holds by (E.36) and (E.37). Then for any ε, there exists B2,ε such that

limT→∞

sup(γ,F )∈H0

Pr F (TT (γ) 6= TT (γ;∞, B2,ε)) < ε. (E.39)

Combining this with (E.34), we have (E.7).

STEP 4. In order to show (E.8), first extend the definition of T apprT (γ;B1, B2) from Step 1 to

allow B1 and B2 to take the value ∞ and observe that T apprT (γ;∞,∞) = T apprT (γ).

By the same arguments as those for (E.34), for any ε and B2, there exists B1,ε large enough so

58

that

lim supn→∞

sup(γ,F )∈H0

Pr F(T apprT (γ;∞, B2) 6= T apprT (γ;B1,ε, B2)

)< ε. (E.40)

Also by the same reasons as those for (E.35), we have

T apprT (γ) ≤ supθ∈Θ0,F (γ)

∫GS(νF (θ, g),Σι

F (θ, g))dµ(g), (E.41)

where the right hand side is a real-valued random variable.

For any T and B2, if T apprT (γ) 6= T apprT (γ;∞, B2,ε), then there must be a θ∗ ∈ Θ0,F (γ), a

λ∗∗ ∈ λ ∈ ΛT (θ∗, γ) : T ×QF (θ∗ + λ/√T ) > B2 such that

I(λ∗∗) < supθ∈Θ0,F (γ)


F (θ, g))dµ(g), (E.42)

where I(λ) =∫G S(νF (θ∗, g)+GF (θ∗, g)λ+

√TρF (θ∗, g),Σι

F (θ∗, g))dµ(g). Next we show that if λ∗∗

exists, then there must exists a λ∗ such that

λ∗ ∈ λ ∈ ΛT (θ∗, γ) : T ×QF (θ∗ + λ/√T ) ∈ (B2, 2B2] and

I(λ∗) < supθ∈Θ0,F (γ)


F (θ, g))dµ(g). (E.43)

If T ×QF (θ∗+λ∗∗/√T ) ∈ (B2, 2B2], then we are done. If T ×QF (θ∗+λ∗∗/

√T ) > 2B2, there must

be a a∗ ∈ (0, 1) such that T ×QF (θ∗+ a∗λ∗∗/√T ) ∈ (B2, 2B2] because TQF (θ∗+ 0×λ∗∗/

√T ) = 0

and TQF (θ∗ + aλ∗∗/√T ) is continuous in a (by Assumptions D.2(e) and D.6(a)). By Assump-

tion D.6(f), I(λ) is convex. Thus I(a∗λ∗∗) ≤ a∗I(λ∗∗) + (1 − a∗)I(0). For the same argu-

ments as those for (E.35), I(0) ≤ supθ∈Θ0,F (γ)

∫G S(νF (θ, g),Σι

F (θ, g))dµ(g). Thus, I(a∗λ∗∗) <

supθ∈Θ0,F (γ)

∫G S(νF (θ, g),Σι

F (θ, g))dµ(g). Assumption (D.1)(c) and the definition of ΛT (θ, γ) guar-

antee that a∗λ∗∗ ∈ ΛT (θ∗, γ). Therefore, λ∗ = a∗λ∗∗ satisfies (E.43).

Similar to (E.19) we have

(1) ||λ∗||/√T ≤ B2 × 2C × T−1/δ2 = B2 × o(1)

(2) supg∈G||GF (θ∗ +O(||λ∗||)/

√T , g)λ∗ −GF (θ∗, g)λ∗||

≤ O(1)×B(δ1+1)/δ22 ||λ||δ1+1T−δ1/2 = B

(δ1+1)/δ22 o(1), (E.44)

59

where the o(1) terms do not depend on B2. Then,

I(λ∗) ≥ 2−1

∫GS(GF (θ∗, g)λ∗ +

√TρF (θ∗, g),Σι

F (θ∗, g))dµ(g)−∫GS(−νF (θ∗, g),Σι

F (θ∗, g))dµ(g)

≥ TQF (θ∗ + λ∗/√T )/2− C2 × (TQF (θ∗ + λ∗/

√T ) + 1)× c(∆T )/2 +Op(1)

= TQF (θ∗ + λ∗/√T )/2− C2 × (2B2 + 1)× c(∆T )/4 +Op(1), (E.45)

where the Op(1) term is uniform over (γ, F ) ∈ H0, c(x) = (x+√x2 + 8x)/2 and

∆T := ‖GF (θ∗, g)λ∗ +√TρF (θ∗, g)−

√TρF (θ∗ + λ∗/

√T , g)‖2

+ ‖vech(ΣιF (θ∗ + λ∗/

√T , g)− Σι

F (θ∗, g))‖. (E.46)

The first inequality in (E.45) holds by Assumptions D.6(e)-(f), the second inequality holds by

(E.9) and the equality holds by (E.43). By (E.44) and Assumption D.2(f), for any fixed B2,

limT→∞∆T = 0. Therefore, for each fixed B2,

I(λ∗) ≥ TQF (θ∗ + λ∗/√T )/2−Op(1) ≥ B2/2−Op(1). (E.47)

Thus

sup(γ,F )∈H0

Pr(T apprT (γ) 6= T apprT (γ;∞, B2))

≤ sup(γ,F )∈H0

Pr

(sup

θ∈Θ0,F (γ)


F (θ, g))dµ(g) ≥ B2/2−Op(1)

)= sup

(γ,F )∈H0

Pr(Op(1) ≥ B2). (E.48)

For any ε > 0, there exists B2,ε large enough so that limT→∞ sup(γ,F )∈H0Pr(Op(1) ≥ B2) < ε.

Thus,

limT→∞

sup(γ,F )∈H0

Pr(T apprT (γ) 6= T apprT (γ;∞, B2,ε) < ε. (E.49)

Combining this with (E.40), we have (E.8).

E.2 Proof of Theorem D.1

The following lemma is used in the proof of Theorem D.1. It shows the convergence of the bootstrap

empirical process ν∗T (θ, g). Let WT,t be the number of times that the tth observation appearing in

a bootstrap sample. Then (WT,1, ...,WT,T ) is a random draw from a multinomial distribution with

60

parameters T and (T−1, ..., T−1), and ν∗T (θ, g) can be written as

ν∗T (θ, g) = T−1/2T∑t=1

(WT,t − 1)ρ(wt, θ, g). (E.50)

In the lemma, the subscripts F and W for E and Pr signify the fact that the expectation and the

probabilities are taken with respect to the randomness in the data and the randomness in WT,trespectively.

Lemma E.2. Suppose that Assumption D.2 holds. Then for any ε > 0,

(a)lim supT→∞ sup(γ,F )∈H Pr ∗F (supf∈BL1|EW f(ν∗T (·, ·))− Ef(νF (·, ·))| > ε) = 0,

(b) there exists Bε large enough such that

lim supT→∞

sup(γ,F )∈H

Pr ∗F

(PrW

(sup

θ∈Γ−1(γ),g∈G||ν∗T (θ, g)|| > Bε

)> ε

)= 0, and

(c) there exists δε small enough such that

lim supT→∞

sup(γ,F )∈H

Pr ∗F

(PrW

(supg∈G

sup||θ(1)−θ(2)||≤δε

||ν∗T (θ(1), g)− ν∗T (θ(2), g)|| > ε

)> ε

)= 0.

Proof of Lemma E.2. (a) Part (a) is proved using a combination of the arguments in Theorem 2.9.6

and Theorem 3.6.1 in van der Vaart and Wellner (1996). Take a Poisson number NT with mean

T and independent from the original sample. Then WNT ,1, ...,WNT ,T are i.i.d. Poisson variables

with mean one. Let the Poissonized version of ν∗T (θ, g) be

νpoiT (θ, g) = T−1/2T∑t=1

(WNT ,t − 1)ρ(wt, θ, g). (E.51)

Theorem 2.9.6 in van der Vaart and Wellner (1996) is a multiplier central limit theorem that shows

that if ρ(wt, θ, g) : (θ, g) ∈ Θ × G is F -Donsker and pre-Gaussian, then νpoiT (θ, g) converges

weakly to νF (θ, g) conditional on the data in outer probability. The arguments of Theorem 2.9.6

remain valid if we strengthen the F -Donsker and pre-Gaussian condition to the uniform Donsker

and pre-Gaussian condition of Assumption D.2(c) and strengthen the conclusion to uniform weak

convergence:

lim supT→∞

sup(γ,F )∈H

Pr ∗F

(supf∈BL1

|EW f(νpoiT (·, ·))− Ef(νF (·, ·))| > ε

)= 0, (E.52)

In particular, the extension to the uniform versions of the first and the third displays in the proof

of Theorem 2.9.6 in van der Vaart and Wellner (1996) is straightforward. To extend the second

display, we only need to replace Lemma 2.9.5 with Proposition A.5.2 – a uniform central limit

theorem for finite dimensional vectors.

61

Theorem 3.6.1 in van der Vaart and Wellner (1996) shows that, under a fixed (γ, F ), the bounded

Lipschitz distance between νpoiT (θ, g) and ν∗T (θ, g) converge to zero conditional on (outer) almost

all realizations of the data. The arguments remain valid if we strengthen the Glivenko-Cantelli

assumption used there to uniform Glivenko-Cantelli (which is implied by Assumption D.2(c)) and

strengthen the conclusion to: for all ε > 0

lim supT→∞

sup(γ,F )∈H

Pr ∗F

(supf∈BL1

|EW f(νpoiT (·, ·))− EW f(ν∗T (·, ·))| > ε

)= 0, (E.53)

Equations (E.52) and (E.53) together imply part (a).

(b) Part (b) is implied by part (a), Lemma E.1(b) and the uniform pre-Gaussianity assumption

(Assumption D.2(c)). When applying Lemma E.1(b), consider X(1)T = ν∗T , X

(2)T = νF , G1 = ν :

supθ,g ‖ν(θ, g)‖ ≥ Bε, and G2 = ν : supθ,g ‖ν(θ, g)‖ > Bε − 1 where Bε satisfies:

sup(γ,F )∈H

Pr

(sup

θ∈Θ,g∈G‖νF (θ, g)‖ > Bε − 1

)< ε/2. (E.54)

Such a Bε exists because ρ(wt, θ, g) : (θ, g) ∈ Θ × G is uniformly pre-Gaussian by Assumption

D.2(c).

(c) Part (c) is implied by part (a), Lemma E.1(b) and the uniform pre-Gaussianity assumption

(Assumption D.2(c)). When applying Lemma E.1(b), consider X(1)T = ν∗T , X

(2)T = νF , G1 =

ν : sup||θ(1)−θ(2)||≤∆ε,g‖ν(θ(1), g) − ν(θ(2), g)‖ ≥ ε, and G0 = ν : sup||θ(1)−θ(2)||≤∆ε,g

‖ν(θ(1), g) −ν(θ(2), g)‖ > ε/2, where ∆ε satisfies:

sup(γ,F )∈H

Pr

(sup

‖θ(2)−θ(2)‖≤∆ε,g

‖νF (θ(1), g)− νF (θ(2), g)‖ > ε/2

)< ε/2. (E.55)

Such a ∆ε exists because ρ(wt, θ, g) : (θ, g) ∈ Θ× G is uniformly pre-Gaussian.

Proof of Theorem D.1. (a) Let qapprbT(γ, p) denotes the p quantile of T apprbT

(γ). Let η2 = η∗/3. Below

we show that,

lim supT→∞

sup(γ,F )∈H0

PrF,sub(csubT (γ, p) ≤ qapprbT

(γ, p) + η2) = 0. (E.56)

where Pr ∗F,sub signifies the fact that there are two sources of randomness in csubT (γ, p) one from the

62

original sampling and the other from the subsampling. Once (E.56) is established, we have,

lim infT→∞

inf(γ,F )∈H0

PrF,sub

(TT (γ) ≤ csubT (γ, p)

)≥ lim inf

T→∞inf

(γ,F )∈H0

PrF

(TT (γ) ≤ qapprbT

(γ, p) + η2

)≥ lim inf

T→∞inf

(γ,F )∈H0

[PrF

(TT (γ) ≤ qapprbT

(γ, p) + η2

)− Pr

(T apprT (γ) ≤ qapprbT

(γ, p))]

+ lim infT→∞

inf(γ,F )∈H0

[Pr(T apprT (γ) ≤ qapprbT

(γ, p))− Pr

(T apprbT

(γ) ≤ qapprbT(γ, p)

)]+ lim inf

T→∞inf

(γ,F )∈H0

Pr(T apprbT

(γ) ≤ qapprbT(γ, p)

)(E.57)

≥ p,

where the first inequality holds by (E.56). The third inequality holds because the first two lim infs

after the second inequality are greater than or equal to zero and the third is greater than or equal

to p. The first lim inf is greater than or equal to zero by Theorem E.1. The second lim inf is greater

than or equal to zero T apprbT(γ) ≥ T apprT (γ) for any γ and T which holds because

√T ≥

√bT and

ΛbT (θ, γ) ⊆ ΛT (θ, γ) for large enough T by Assumptions D.1(c) and D.7(c).

Now it is left to show (E.56). In order to show (E.56), we first show that the c.d.f. of T apprbT(γ)

is close to the following empirical distribution function:

LT,bT (x; γ) = S−1T

ST∑s=1

1(T sT,bT (γ) ≤ x

). (E.58)

Define an intermediate quantity first:

LT,bT (x; γ) = q−1T

qT∑l=1

1(T lT,bT (γ) ≤ x

), (E.59)

where qT = ( TbT ) and (T lT,bT (γ))qTl=1 are the subsample statistics computed using all qT possible

subsamples of size bT of the original sample. Conditional on the original sample, (T sT,bT (γ))STs=1 is

ST i.i.d. draws from LT,bT (·; γ). By the uniform Glivenko-Cantelli theorem, for any ε > 0,

lim supT→∞

sup(γ,F )∈H0

Pr F,sub

(supx∈R

∣∣∣LT,bT (x; γ)− LT,bT (x; γ)∣∣∣ > ε

)= 0 (E.60)

It is implied by a Hoeffding’s inequality (Theorem A on page 201 of Serfling (1980a)) for U-statistics

that for any real sequence xT , and ε > 0,

lim supT→∞

sup(γ,F )∈H0

PrF

(LT,bT (xT ; γ)− PrF

(T lT,bT (γ) ≤ xT

)> ε)

= 0. (E.61)

63

Equations (E.60) and (E.61) imply that, for any real sequence xT and ε > 0,

lim supT→∞

sup(γ,F )∈H0

PrF,sub

(LT,bT (xT ; γ)− PrF

(T lT,bT (γ) ≤ xT

)> ε)

= 0. (E.62)

Apply Theorem E.1 on the subsample statistic T lT,bT (γ), and we have for any ε > 0 and any

real sequence xT ,

lim supT→∞

sup(γ,F )∈H0

[PrF

(T lT,bT (γ) ≤ xT − ε

)− Pr

(T apprbT

(γ) ≤ xT)]

< 0. (E.63)

Equations (E.62) and (E.63) imply that for any real sequence xT ,

sup(γ,F )∈H0

PrF,sub

(LT,bT (xT ; γ) >

(η2 + Pr

(T apprbT

(γ) ≤ xT + η2

)))→ 0. (E.64)

Plug xT = qapprbT(γ, p)− 2η2 into the above equation and we have:

lim supT→∞

sup(γ,F )∈H0

Pr∗F,sub

(LT,bT (qapprbT

(γ, p)− 2η2; γ) > η2 + p)

= 0. (E.65)

However, by the definition of csubT (γ, p), LT,bT (csubT (γ, p)− η∗; γ) ≥ p+ η∗ > η2 + p. Therefore

lim supn→∞

sup(γ,F )∈H0

Pr∗F,sub

(LT,bT (qapprbT

(γ, p)− 2η2; γ) ≥ LT,bT (csubT (γ, p)− η∗; γ))

=0, (E.66)

which implies (E.56).

(b) Let qbtκT (γ, p) be the p quantile of T apprκT (γ) conditional on the original sample. Below we

show that for η2 = η∗/3,

lim supT→∞

sup(γ,F )∈H0

PrF,W (cbtT (γ, p) < qbtκT (γ, p) + η2) = 0. (E.67)

where Pr F,W signifies the fact that there are two sources of randomness in cbtT (γ, p), that from the

original sampling and that from the bootstrap sampling. Once (E.67) is established, we have,

lim infT→∞

inf(γ,F )∈H0

PrF,W

(TT (γ) ≤ cbtT (γ, p)

)≥ lim inf

T→∞inf

(γ,F )∈H0

PrF

(TT (γ) ≤ qbtκT (γ, p) + η2

)≥ lim inf

T→∞inf

(γ,F )∈H0

Pr(T apprT (γ) ≤ qbtκT (γ, p)

)≥ lim inf

T→∞inf

(γ,F )∈H0

Pr(T apprκT

(γ) ≤ qbtκT (γ, p))

= p, (E.68)

where the first inequality holds by (E.67), the second inequality holds by Theorem E.1 and the third

inequality holds because T apprκT (γ) ≥ T apprT (γ) for any γ and T which holds because√T ≥ √κT and

64

ΛκT (θ, γ) ⊆ ΛT (θ, γ) for large enough T by Assumptions D.1(c) and D.7(c).

Now we show (E.67). First, we show that the c.d.f. of T apprκT (γ) is close to the following empirical

distribution:

FST (x, γ) = S−1T

ST∑l=1

1T ∗T,l(γ) ≤ x, (E.69)

where T ∗T,1(γ), ..., T ∗T,ST (γ) are the ST conditionally independent copies of the bootstrap test

statistics. By the uniform Glivenko-Cantelli Theorem, FST (x, γ) is close to conditional c.d.f. of

T ∗T (γ): for any η > 0

lim supT→∞

sup(γ,F )∈H0

Pr F,W

(supx∈R|FSn(x, γ)− PrW (T ∗T (γ) ≤ x)| > η

)= 0. (E.70)

The same arguments as those for Theorem E.1 can be followed to show that T ∗T (γ) is close in

law to T apprκT (γ) in the following sense: for any real sequence xT ,

lim supT→∞

sup(γ,F )∈H0

Pr F([

PrW (T ∗T (γ) ≤ xT − η2)− Pr(T apprκT(γ) ≤ xT )

]≥ η2

)= 0. (E.71)

When following the arguments for Theorem E.1, we simply need to observe the resemblance between

TT (γ) and T ∗T (γ) in the following form:

T ∗T (γ) = minθ∈Θ0,F (γ)

minλ∈ΛκT (θ,γ)∫

GS(ν∗+T (θ + λ/

√T , g) +GF (θT , g)λ+

√κTρF (θ, g), Σn(θ + λ/

√T , g))dµ(g), (E.72)

where

ν∗+T (θ, g) = ν∗T (θ, g) + κ1/2T n−1/2νT (θ, g), (E.73)

and use Lemma E.2 in conjunction with Assumptions D.2(c) and use Lemma E.1(b) in place of

E.1(a).

Equations (E.70) and (E.71) together imply that for any real sequence xT ,

lim supT→∞

sup(γ,F )∈H0

Pr F,W([FST (xT − η2, γ)− Pr(T apprκT

(γ) ≤ xT )]≥ 2η2

)= 0. (E.74)

Plug in xT = qapprκT (γ, p)− η2 and we have

lim supT→∞

sup(γ,F )∈H0

Pr F,W(FST (qapprκT

(γ, p)− 2η2, γ) ≥ p+ 2η2

)= 0. (E.75)

But by definition, FST (cbtT (γ, p)− η∗, γ) ≥ p+ η∗ > p+ 2η2. Therefore,

lim supT→∞

sup(γ,F )∈H0

Pr F,W

(FST (qapprκT

(γ, p)− 2η2, γ) ≥ FST (cbtT (γ, p)− η∗, γ))

= 0, (E.76)

65

which implies (E.67).

References

Andrews, D. W. K., and X. Shi (2013): “Inference Based on Conditional Moment Inequality

Models,” Econometrica, 81.

Bugni, F. A., I. A. Canay, and X. Shi (2016): “Inference for Subvectors and Other Functions

of Partially Identified Parameters in Moment Inequality Models,” Quantitative Economics.

Chernozhukov, V., H. Hong, and E. Tamer (2007): “Estimation and Confidence Regions for

Parameter Sets in Econometric Models,” Econometrica, 75, 1243–1284.

Romano, J., and A. Shaikh (2008): “Inference for Identifiable Parameters in Partially Identified

Econometric Models,” Journal of Statistical Planning and Inference, (Special Issue in Honor of

T. W. Anderson, Jr. on the Occasion of his 90th Birthday), 138, 2786–2807.

Serfling, R. J. (1980a): Approximation Theorems in Mathematical Statistics. John Wiley and

Sons, INC.

van der Vaart, A., and J. Wellner (1996): Weak Convergence and Empirical Processes: with

Applications to Statistics. Springer.

66

Date post:	05-Jun-2018
Category:	Documents
Upload:	dangque
View:	214 times
Download:	0 times

Estimating Demand for Di erentiated Products with Zeroes ...xshi/research/gandhi_lu_shi.pdf ·...

Documents