+ All Categories
Home > Documents > Instrumental variables estimation with many weak instruments using regularized JIVE

Instrumental variables estimation with many weak instruments using regularized JIVE

Date post: 30-Dec-2016
Category:
Upload: damian
View: 219 times
Download: 4 times
Share this document with a friend
46

Click here to load reader

Transcript
Page 1: Instrumental variables estimation with many weak instruments using regularized JIVE

Accepted Manuscript

Instrumental variables estimation with many weak instruments usingregularized JIVE

Christian Hansen, Damian Kozbur

PII: S0304-4076(14)00091-8DOI: http://dx.doi.org/10.1016/j.jeconom.2014.04.022Reference: ECONOM 3913

To appear in: Journal of Econometrics

Received date: 13 December 2012Revised date: 12 January 2014Accepted date: 23 April 2014

Please cite this article as: Hansen, C., Kozbur, D., Instrumental variables estimation with manyweak instruments using regularized JIVE. Journal of Econometrics (2014),http://dx.doi.org/10.1016/j.jeconom.2014.04.022

This is a PDF file of an unedited manuscript that has been accepted for publication. As aservice to our customers we are providing this early version of the manuscript. The manuscriptwill undergo copyediting, typesetting, and review of the resulting proof before it is published inits final form. Please note that during the production process errors may be discovered whichcould affect the content, and all legal disclaimers that apply to the journal pertain.

Page 2: Instrumental variables estimation with many weak instruments using regularized JIVE

INSTRUMENTAL VARIABLES ESTIMATION WITH MANY WEAKINSTRUMENTS USING REGULARIZED JIVE

CHRISTIAN HANSEN AND DAMIAN KOZBUR

Abstract. We consider instrumental variables regression in models where the number of avail-

able instruments may be larger than the sample size and consistent model selection in the first

stage may not be possible. Such a situation may arise when there are many weak instruments.

With many weak instruments, existing approaches to first-stage regularization can lead to a

large bias relative to standard errors. We propose a jackknife instrumental variables estimator

(JIVE) with regularization at each jackknife iteration that helps alleviate this bias. We derive

the limiting behavior for a ridge-regularized JIV estimator (RJIVE), verifying that the RJIVE

is consistent and asymptotically normal under conditions which allow for more instruments than

observations and do not require consistent model selection. We provide simulation results that

demonstrate the proposed RJIVE performs favorably in terms of size of tests and risk properties

relative to other many-weak instrument estimation strategies in high-dimensional settings. We

also apply the RJIVE to the Angrist and Krueger (1991) example where it performs favorably

relative to other many-instrument robust procedures.

Key Words: ridge regression, high-dimensional models, endogeneity

1. Introduction

Instrumental variables (IV) regression is commonly used in economic research to calculatetreatment effects for endogenous regressors. While the use of instrumental variables can aidin identification of structural effects, IV estimates of structural effects are often imprecise inpractice as only variation in the endogenous variables induced by the instruments is used inestimating the treatment effect. One strategy to increasing the precision of IV estimates is toinclude many instruments in the hope that these will capture as much exogenous variation inthe explanatory variable as possible. The use of many instruments can also be motivated by adesire to nonparametrically estimate the optimal instruments via series as in Newey (1990); seealso Amemiya (1974) and Chamberlain (1987). In addition, the increasing availability of high-dimensional data makes it likely that applications where the number of potential instruments issimilar to or larger than the number of observations will become more common in applied work;see, for example, the empirical application in Belloni, Chen, Chernozhukov, and Hansen (2010)(BCCH hereafter). While the improvement in efficiency available from using many instrumentsis appealing, it is well-known that the usual GMM-type approaches to estimating structural

Date: First version: August 2012, this version May 1, 2014.

1

Page 3: Instrumental variables estimation with many weak instruments using regularized JIVE

2 CHRISTIAN HANSEN AND DAMIAN KOZBUR

parameters using instrumental variables, which include IV and 2SLS, may have substantial biaswhen the number of instruments is not small relative to the sample size; see Bekker (1994)and Newey and Smith (2004). This bias results in poor performance of the usual asymptoticapproximation in finite sample simulation experiments and theoretically leads to inconsistencyof the 2SLS estimator in an asymptotic sequence where the number of instruments is smallerthan but proportional to the sample size.

In this paper, we present an estimation and inference procedure that remains valid in thepresence of very many instruments, allowing for more instruments than there are observations.The strategy we consider uses a jackknife to estimate first stage predictions of the endogenousvariables. The chief innovation of our procedure is the use of ridge regression at each iterationof the jackknife. The use of ridge regression regularizes the problem and allows us to avoidextreme overfitting of the first-stage even when there are more instruments than observations inthe sample. We provide asymptotic theory for the resulting regularized jackknife instrumentalvariables estimator (RJIVE), giving conditions under which the RJIVE is consistent and asymp-totically normal. Importantly, the conditions we impose allow for the number of instruments,K, to be larger than the sample size, n, and do not require that the first-stage relationshipbetween the endogenous variables and instruments be sparse. That is, we allow for very manyinstruments and do not impose that there is a low-dimensional set of variables among the set ofinstruments considered that captures the relationship between the instruments and endogenousvariables. That we do not assume sparsity allows us to consider scenarios where there are manyinstruments each of which is only weakly related to the endogenous variables, i.e. scenarios with“many weak instruments.” The presence of many weak instruments rules out consistent variableselection and first stage estimation. Despite this, the use of the jackknife coupled with regular-ization allows us to sufficiently avoid overfitting while simultaneously extracting sufficient signalin the first-stage to allow consistent and asymptotically normal estimation of the structuralparameter of interest. In addition to providing theory for our proposal, we provide simulationevidence that suggests that the RJIVE performs well relative to other weak instrument robustprocedures and other regularized IV estimators. We also present a brief empirical example.

Our work complements existing estimation and inference strategies that are robust to manyinstruments. One approach that has received considerable attention is the use of “many-instrument” asymptotics, popularized by Bekker (1994). The goal of many-instrument asymp-totics is to provide the behavior of estimators under approximating sequences where the numberof instruments, K, is smaller than the sample size, n, but K

n → ρ where 0 ≤ ρ < 1 and to findestimators that perform well under this approximation. While the traditional 2SLS estimatoris inconsistent in this environment, other IV estimators including LIML, Fuller’s (1977) mod-ification of LIML (FULL), and jackknife instrumental variables (JIV) remain consistent andasymptotically normal; see Bekker (1994), Chao and Swanson (2005), Hansen, Hausman, and

Page 4: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 3

Newey (2008), and Chao, Swanson, Hausman, Newey, and Woutersen (2012) (CSHNW here-after).1 Under the many-instrument sequence, the asymptotic variance of these estimators differsfrom the asymptotic variance under the usual asymptotics but can be consistently estimated.In simulations, these estimators perform relatively well when the number of instruments is anappreciable fraction of the sample size, and simulation evidence suggests that inference basedon these estimators and the many-instrument asymptotic distribution controls size of tests farbetter than inference based on the usual asymptotic approximation that takes the number ofinstruments to be small relative to the sample size. A drawback of the many-instrument as-ymptotic approach is that it requires the number of instruments to be less than the sample size.Many instrument robust estimators also tend to perform poorly in simulations when K

n ≈ 1; seeBCCH. We contribute to this literature by considering cases where K > n and regularizationor instrument selection is necessary. Within this setting, we show that the RJIVE retains thedesirable asymptotic features derived in CSHNW.

The RJIVE is also related to and complements many-instrument estimation strategies thatmake use of first-stage variable selection. There is a long history of using first-stage vari-able selection to estimate IV models. The idea of instrument selection goes back at least toKloek and Mennes (1960) and Amemiya (1966). Recently, Bai and Ng (2009) considered usingmodern variable selection techniques for first-stage instrument selection, and BCCH provide aformal analysis of IV estimators with first-stage fit using methods for fitting high-dimensional-sparse models such as LASSO, e.g. Tibshirani (1996) or Bickel, Ritov, and Tsybakov (2009),or Boosting, e.g. Buhlmann (2006). See also Gautier and Tsybakov (2011) for a differentsparsity-based approach to IV estimation related to BCCH. Related ideas also appear in Baiand Ng (2010), Kapetanios and Marcellino (2010), Kapetanios, Khalaf, and Marcellino (2011),and Caner (2009). A common condition in all of these approaches is that the first-stage is sparse;that is, there is a relatively small number of instruments contained within a known set of instru-ments that provide a good approximation to the relationship between the endogenous variablesand instruments. The assumed sparsity in these approaches rules out many-weak-instruments.Donald and Newey (2001) consider a different style of variable selection procedure that mini-mizes higher-order asymptotic MSE which relies on a priori knowledge that allows one to orderthe instruments in terms of instrument strength. The regularization approach we consider doesnot rely on variable selection which allows us to relax the sparsity requirement and does notrequire ex ante knowledge about the ordering of instruments. Of course, the added robustnessagainst many-weak-instruments is not costless, and we would expect approaches based on vari-able selection to be more efficient than the RJIVE when sparsity provides a good approximationto the underlying data.

1Under many-instrument asymptotics, LIML and FULL are only consistent in the absence of heteroskedasticity.

CSHNW provide a JIV estimator that remains consistent under many-instrument asymptotics in the presence of

heteroskedasticity.

Page 5: Instrumental variables estimation with many weak instruments using regularized JIVE

4 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Our paper is also related to shrinkage-based approaches to dealing with many instruments.Chamberlain and Imbens (2004) considers IV estimation with many instruments using a shrink-age estimator based on putting a Gaussian random coefficients structure over the first-stagecoefficients in a homoskedastic setting which is closely related to using ridge regression in thefirst-stage. Okui (2010) uses ridge regression for estimating the first-stage regression in a ho-moskedastic framework where the instruments may be ordered in terms of relevance. Okui(2010) derives the asymptotic distribution of the resulting IV estimator and provides a methodfor choosing the ridge regression smoothing parameter that minimizes the higher-order asymp-totic mean-squared-error (MSE) of the IV estimator. Perhaps the closest paper to our approachis Carrasco (2012) which is based on directly regularizing the inverse that appears in the defini-tion of the 2SLS estimator; see also Carrasco and Tchuente Nguembu (2012). Carrasco (2012)considers three regularization schemes, including Tikhonov regularization which corresponds toridge regression, and shows that the regularized estimators achieve the semi-parametric effi-ciency bound under some conditions. The theoretical development in Carrasco (2012) allows forK n and more generally, an infinite set of instruments for each n, but places restrictions onthe covariance structure of the instruments which are not needed for our approach. Thus, thetwo papers are complementary.

2. A Dense High-Dimensional Instrumental Variables Model

In this section, we provide an intuitive discussion of the model we consider. The model issimilar to a conventional linear instrumental variables model where interest focuses on structuralparameters from a single equation. However, our setup differs from the traditional frameworkin two key respects. First, we do not assume that the number of instruments is smaller than thesample size and explicitly address the resulting need for regularization. Second, unlike manyother models that allow for a high-dimensional instrument set, we allow the relationship betweenthe instruments and the endogenous variables to be dense in the sense that all instruments mayhave non-zero coefficients but individually make only small contributions to forecasting theendogenous variable.

2.1. The Model. We consider a sequence of models which holds for all observations i = 1, ..., nand all n given by

yi = X ′iδ0 + εi (2.1)

Xi = Υi(Zi) + Ui (2.2)

where yi is a scalar outcome of interest, Xi is a G-dimensional treatment variable, Zi is a K-dimensional instrument with K ≥ G, Υi(·) is a G-dimensional “optimal” instrument, and δ0

is the G-dimensional structural effect of interest. In the model, E[Uiεi] 6= 0 which leads toendogeneity of Xi and inconsistency of the conventional regression of y on X. We assume that

Page 6: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 5

Υ is a signal of the same dimension as X that captures the part of X that is mean-independentof ε and related to the observed instrumental variables in Z where we leave the dependence of Υon Z implicit throughout the remainder of the paper for notational simplicity. Specifically, wewill assume that E[Ui|Υi, Zi] = 0 and that E[εi|Υi, Zi] = 0 for all i. Further, we will assume thatVar(Υi) 6= 0. Thus, Υ is a valid instrument for X, and we refer to Υ as the optimal instrument.2

Estimation of δ0 could be achieved by a straighforward application of classical instrumentalvariables methods if Υ were observed.

The assumption that one knows the optimal instrument, Υ, seems unrealistic in many situ-ations. To capture this, we assume that yi and Xi are observed but that Υi is not. Rather,we consider the case where estimation will be based on a K-dimensional instrument Zi whichprovides a signal about Υi. We focus on the case of an approximately linear signal, Υi ≈ Z ′iΠ,and note that this is a relatively weak restriction since the vector Zi could consist of a dictionaryof transformations of some more elementary variables.3 E.g. we could have Zi = pkK(ζi)Kk=1

for some set of basis functions pkK(·) such as orthogonal polynomials or splines formed froma set of “fundamental” instruments ζi.

2.2. High-Dimensional Instruments. We are particularly interested in the case where thenumber of instruments in Zi, K, is large relative to the number of observations in the data, n. Forexample, many instrumental variables may exist because the set of available instruments itself ishigh-dimensional as in BCCH or because one is interested in approximating Υ through the useof basis expansions as in Newey (1990), Hansen, Hausman, and Newey (2008), or CSHNW. Withmany instruments, regularization on the instrument set is desirable as it helps to avoid overfittingof the relationship between the instruments and endogenous variables. It is this overfitting thatleads to inconsistency of the standard GMM estimator of δ0. Estimation procedures that remainvalid under many-instrument-asymptotics where K

N → ρ < 1 such as LIML or JIV implicitlymake use of regularization to avoid this overfitting. When K is larger than n, these strategiesalso become inconsistent, and it is clear that further dimension reduction or regularization ofthe instrument set is necessary for consistent estimation of δ0 unless Ui and εi are uncorrelated.

As an illustration, consider the class of estimators used in Hansen, Hausman, and Newey(2008) which include all the so-called k-class estimators except for OLS:

δ = (X ′PX − αX ′X)−1(X ′PY − αX ′Y )

2For example, we would define Υ = E[X|Z] in the canonical homoskedastic case which is the optimal instrument

as given in Chamberlain (1987) or Newey (1990).3We ignore the presence of included exogenous variables that show up in both the structural equation (2.1)

and first-stage equation (2.2). It would be straightforward to accommodate a known, fixed-dimensional vector of

such variables. In this case, the variables in model (2.1)-(2.2) and instruments Zi may be defined as residuals

from projecting the observed variables onto the included exogenous variables. An interesting extension would be

to consider estimation when the dimension of the included exogenous variables is large relative to the sample size

in which case regularization would be needed over the effects of these variables as well.

Page 7: Instrumental variables estimation with many weak instruments using regularized JIVE

6 CHRISTIAN HANSEN AND DAMIAN KOZBUR

where P = Z(Z ′Z)−Z ′, A− denotes a generalized inverse of A, X is an n×G matrix formed bystacking the Xi, Z is an n×K matrix formed by stacking the Zi, Y is an n×1 vector formed bystacking the yi, and α is specified by the researcher. When K ≥ n and Z has full row rank, wehave X ′PX = X ′X and X ′PY = X ′Y .4 Then for a fixed α 6= 1,5 δ reduces to the OLS estimatordefined by the regression of Y on X and is inconsistent for estimating δ0 unless E[Xiεi] = 0.The class of JIVE estimators considered in CSHNW require (Z ′Z)−1 in their construction andthus rely on K ≤ n in practice.6 Further regularization relative to what is already implicit inconventional many-instrument-robust procedures may also be desirable in cases where K < n asthe additional regularization may produce estimators that are better-behaved in finite samples,especially when K

n is close to one. Finally, note that these estimators may be written as a genericIV estimator

δIV = (Υ′X)−1(Υ′Y ) (2.3)

for Υ = (P − αIn)X where In is the n× n identity matrix.

2.3. Regularization with a Dense Signal. There are, of course, many options available forperforming regularization. Perhaps the most obvious approach to regularization is to directlyreduce the number of instruments either through intuition or some more formal mechanism. Anydimension reduction mechanism that makes use of only Zi without regard to Xi will produce a setof instruments, Zi, that satisfies the exclusion restriction E[Ziεi] = 0 if the original instrumentssatisfied E[Ziεi] = 0. Such approaches include, for example, selecting a set of instrumentsat random or performing a factor decomposition of Zi and choosing the first few factors asinstruments. The drawback of such approaches is that they may discard some or even all of thesignal about the relationship between the instruments and endogenous variables if the correctset of factors is not chosen. Discarding this signal will result in a loss in efficiency and mayresult in a lack of identification if insufficient signal is maintained.

The desire to maintain the signal available in the instruments while simultaneously regu-larizing the estimation problem leads to the consideration of data-driven dimension reductionschemes that make use of the information in Xi as well as in the instrument set. A naturalapproach is to use a variable selection procedure to select a small dimensional set of instrumentsto then use in a conventional IV-based estimator. Bai and Ng (2009), BCCH , and Gautier andTsybakov (2011) provide examples of this approach. The formal validity of this approach relieson the assumption that the signal, Υ, may be well-approximated by a sparse model in Z. That

4If Z has full row rank, P is a well-defined projection matrix onto Rn, and PX = X and PY = Y immediately

follow since X ∈ Rn and Y ∈ Rn (and P = In). This property is easy to see using the Moore-Penrose generalized

inverse by noting that full row rank of Z implies (Z′Z)− = Z−Z′−, Z− = Z′(ZZ′)−1, and Z

′− = (ZZ′)−1Z for

the Moore-Penrose generalized inverse. It then follows that P = Z(Z′Z)−Z′ = ZZ′(ZZ′)−1(ZZ′)−1ZZ′ = In.5LIML uses a data-dependent α which is identically equal to one when K ≥ n.6Formally, CSHNW require that (Z′Z/n) have minimum eigenvalue bounded away from 0 for n large enough.

Page 8: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 7

is, these approaches make use of a model where Υi = Z ′iΠ up to a small approximation errorwhere Zi is an s-dimensional set of the “relevant” instruments whose identities are unknownand estimated from the data and which have associated coefficients Π. BCCH show that validinference for δ0 may be obtained after doing first-stage variable selection if s2/n → 0 and alsoshow that this condition can be weakened to s/n → 0 if a split-sample procedure is used.7 In-tuitively, sparsity requires that the signal be concentrated among a small set of factors withinthe set of considered instruments.

To help differentiate the sparse and dense signal cases, consider a sequence of models withonly one endogenous variable that has an exactly linear and homoskedastic first-stage. That is,let Xi = Z ′iΠn + Ui =

∑Kj=1 Zijπj + Ui where E[U2

i |Zi] = σ2U and K is the number of available

instruments as before. Recall that the concentration parameter

µ2n =

nΠ′nE[ZiZ ′i]Πn

σ2U

is a measure of the information available in the instruments that is closely related to the rate ofconvergence of IV estimators under the usual and many-instrument asymptotics with a largervalue of the concentration parameter corresponding to more information. The concentrationparameter satisfies µ2

n/n ≥ C for some positive constant C under the usual asymptotic approx-imation which takes the dimension of Zi to be small and fixed and the coefficients Πn 6= 0K tobe fixed where 0K is a K × 1 vector of zeros. The sparse model generalizes this by allowingK n8 but imposes that µ2

n/n ≥ C and that

µ2n =

nΠ′nE[ZiZ ′i]Πn

σ2U

=nΠ′nE[ZiZ ′i]Πn

σ2U

+O(s)

where Zi are the instruments with non-zero coefficients, the coefficients on Zi are given in Πn,and dim(Zi) = s with s = o(n). That is, the sparse model allows for very many instrumentsbut requires that only a relatively small set of instruments, Zi, have non-zero coefficients. Fur-ther, the sparse model imposes that the usual asymptotics would apply if one used only the sinstruments given in Zi.9

In this paper, we impose a different set of assumptions on the first-stage signal. For exam-ple, our conditions include cases where µ2

n/n ≥ C but the concentration parameters basedon any s = o(n) dimensional subset of the instruments, Zi with coefficients Πn, satisfies

µ2n/n = nΠ′nE[ZiZ

′i]Πn

σ2U

/n → 0.10 In such cases, any model selection or sparsity based procedurewhich selects a small number of instruments for use in estimation will suffer a weak instrumentproblem since there is no low-dimensional set of variables among Zi that provide strong forecasts

7The formal conditions on the growth rate of instruments are slightly more stringent.8For example, the conditions in BCCH only require that log(K) = o(n1/3).9BCCH also proposes an inference method that remains valid which allows for µ2

n/n→ 0 or even µ2n = 0.

10Note that s does not denote the “true” number of non-zero coefficients which may in general be equal to the

total number of variables, K. Rather, we use it to generically denote the dimension of a subset of the instruments.

Page 9: Instrumental variables estimation with many weak instruments using regularized JIVE

8 CHRISTIAN HANSEN AND DAMIAN KOZBUR

of the endogenous variable. Instead, we only require that the concentration parameter giventhe full set of instruments is large in a sense made formal in Section 3.2. This case allows forscenarios where there is signal available in a combination of the full set of instruments but thereis not a small set of instruments which contains the majority of the signal. We refer to this pro-cess for the first-stage signal as a “dense first-stage signal” or more simply as a “dense signal.”It seems that such cases may occur in practice. For example, the available instrument set maybe a large set of dummy variables where there is no obvious reason one would believe the signalconcentrates on a few categories and there is no natural way to aggregate the instrument set.Our proposed approach offers a simple, feasible estimation and inference option in cases such asthis.

As a simple, concrete example, note that the aforementioned behavior would be generated bya model with E[ZiZ ′i] full rank and πj =

√C/K for all j where K/n→ ρ > 0. Considering the

case with Zi iid with mean zero and variance one and with σ2U = 1 for simplicity, one will have

µ2n/n = C but will have µ2

n/n = sC/K → 0 where s = dim(Zi) is the number of instruments insubset Zi as long as the subset is “small” so that s = o(n). Finally, note that a set of selectedvariables will generically satisfy s/n→ 0 for usual sparsity-based model selection procedures.

For further intuition about the difficulty raised by attempting to select instruments when thetrue model is not sparse, recall that valid instruments need to satisfy the exclusion condition

E[Zijεi] = 0.

The exclusion restriction may be violated for a set of instruments selected from a variableselection procedure that uses Xi since

E[Zijεi|Zj selected ] 6= 0

may occur when there are model selection mistakes in which instruments with population coef-ficient exactly equal to zero are erroneously included in the model. For such a selection mistaketo occur, it must be that there is a high within-sample correlation between the instrument andthe first-stage error Ui. Given the correlation between the structural error εi and Ui, theseincorrectly selected instruments will then violate the exclusion restriction.

When the true first-stage relationship is sparse in that there are only s = o(n) variableswith non-zero coefficients and these variables are strongly informative so that usual asymptoticswould apply if their identities were known, consistent model selection is possible. In such cases,the proper use of methods designed for estimating sparse high-dimensional models will resultin very low probability of model selection mistakes where variables with zero coefficients areerroneously included in the model. It can then be shown that the magnitude of the expectationE[Zijεi|Zj selected] will vanish rapidly enough so as not to affect estimation of the structuraleffect δ0; see, e.g., BCCH.

Page 10: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 9

However, consistent variable selection is not feasible in a model with potentially very manyinstruments that are individually weak. The use of default choices of parameters involved invariable selection schemes such as those given in BCCH will often result in selection of novariables since the default choices are based on the assumption that the informative variableshave coefficients that strongly differ from zero. An intuitive but incorrect resolution of thistension is to be less conservative in the model selection step and allow more variables intothe model than would be included by standard implementations of sparse high-dimensionalmodel estimators. Unfortunately, with many weak instruments, the variables selected will bea random set of instruments which consist of informative instruments and the uninformativeinstruments that are most highly correlated to the first-stage noise within sample. The inclusionof these variables in turn implies E[Zijεi|Zj selected ] 6= 0 provided Ui and εi are correlated. Wedemonstrate this bias and the resulting potential for poor performance of post-model-selectioninference in the simulation study reported in Section 4.

Rather than select instruments, one could also estimate the first-stage relationship usingother shrinkage devices. Carrasco (2012) and Carrasco and Tchuente Nguembu (2012) considera variety of regularization devices designed to directly regularize the inverse of the covariancematrix of the instruments, (Z ′Z/n), that shows up in the definition of standard IV estimators.See also Okui (2010) and Chamberlain and Imbens (2004) who take similar approaches. Whilethese approaches are similar to the approach that we take in this paper in that we also useshrinkage rather than variable selection, they differ by placing different restrictions on the model,such as restrictions on the covariance operator of the instruments, that we do not require andby relying on full-sample estimates. Unfortunately, the problem noted above that first-stageestimates obtained from the full-sample of data may violate the exclusion restriction when thereis not a sparse first-stage model is not reserved to strict variable selection devices but also appliesto these other regularization schemes. Intuitively, the coefficients that will tend to be the leastshrunk when there are very many weak instruments will be on the instruments that are mosthighly correlated to the first-stage noise which will result in the same violation of the exclusionrestriction discussed above. We illustrate this bias in the simulation results reported in Section4.

To deal with the correlation between estimated instruments and first-stage errors induced byfirst-stage regularization, we propose using jackknife estimators Υi of Υi. Let φ(Zi;X,Z) bea data-depentent rule for assigning predictions Υi based on Zi. Then by the independence of(εi, Ui) across i, Υi = φ(Zi;X−i, Z−i), where X−i and Z−i represent the data for all observationsbut observation i, is independent of εi. That is, the exclusion restriction

E[Υiεi] = E[φ(Zi;X−i, Z−i)εi] = 0

Page 11: Instrumental variables estimation with many weak instruments using regularized JIVE

10 CHRISTIAN HANSEN AND DAMIAN KOZBUR

holds by construction if the conventional mean independence assumption, E[εi|Zi] = 0, is satis-fied.11 Then using

δ =

(n∑

i=1

Υ′iXi

)−1 n∑

i=1

Υ′iyi

provides consistent estimates of δ0 as long as the regularized leave-one-out-forecast gives infor-mative predictions of Υi.

3. Ridge-Regularized JIVE

In this section, we present the details of the proposed regularized JIVE. We choose to workwith regularization via the ridge regression though we expect that the results below would remainvalid for the combination of the jackknife with other regularization schemes. We focus on ridge-regularization for several reasons. First, since our primary interest is in models for which sparsityin the first stage is not necessarily correct, it seems natural to impose regularization via pureshrinkage rather than a combination of selection and shrinkage as in LASSO or a pure-selectiondevice. Second, the estimator of δ0 when using ridge in the first stage is available in closedform, simplifying the analysis of the theoretical properties. Third, the theory for the ridgeestimator fits very nicely with the existing theory for JIVE with many instruments developedin CSHNW which allows us to present concise theoretical results by adapting arguments fromCSHNW allowing for regularization.

It is worth noting that our desire to accommodate the many-weak-instrument or dense signalsetting leads to some important technical distinctions between the approach we consider anda canonical treatment of high-dimensional estimation in a sparse setting. First, as in BCCH,we focus on estimation of and inference about a fixed-dimensional object: the coefficient on theendogenous variable. As has been demonstrated in the many instrument literature, estimationand inference about this object may proceed without relying on first-stage consistency. Un-like BCCH, we do not establish a semi-parametric efficiency result over this finite-dimensionalparameter because the possibility of a dense signal renders precise estimation of the optimalfirst-stage signal infeasible. More generally, we do not establish any type of oracle or efficiencyresults as is often done in sparse high-dimensional settings. We focus on the less ambitiousgoal of showing that consistent and asymptotically normal estimation of the structural param-eter is feasible if a well-behaved first-stage estimate can be constructed that has non-vanishingcorrelation with the infeasible optimal signal even when more variables than observations anda dense model are allowed. Without leveraging strong beliefs that would lead to non-robustperformance of the estimator across different coefficient structures, it appears that there willalways be a loss of efficiency relative to the infeasible estimator that knows the values of the

11A related approach is to maintain the validity of the exclusion restriction by sample splitting as described

in BCCH where instruments are selected with one half of the sample and the IV model is estimated using the

second half. In initial simulations, we found that the RJIVE produced more efficient results.

Page 12: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 11

first-stage coefficients when one allows for an unknown and potentially strong deviation fromsparsity. Further examination of the efficiency properties of estimators in this setting may bean interesting avenue for future research.

3.1. The Ridge-Regularized JIVE. Recall that the ridge regression coefficient estimate fromthe regression of the variable X onto the set of variables Z is given as

Π = arg minΠ‖X − ZΠ‖22,In + ‖Π‖22,Λ

where we let ‖W‖22,A = W ′A′AW for any vector W and any conformable, positive definitematrix A. That is, the ridge regression chooses parameters to minimize the within sample sumof squared residuals plus a penalty term for the weighted sum of squared regression coefficients.The penalty is designed so that the ridge regression favors models with small coefficients whichhelps to avoid overfitting. The ridge regression is extremely convenient as a shrinkage devicesince the ridge coefficient estimates are available in closed form:

Π = (Z ′Z + Λ′Λ)−1(Z ′X). (3.4)

From this regression, it is apparent that the impact of the penalty term is to stabilize theinverse of the sample covariance matrix of the regressors, Z ′Z. Since Z ′Z is positive semi-definite, the addition of the positive definite matrix Λ′Λ guarantees that the inverse in thedefinition of Π is always well-defined. This behavior is in contrast to the usual OLS estimatorΠOLS = (Z ′Z)−1(Z ′X) which is ill-defined when K > n since Z ′Z is singular by construction andmay be poorly behaved far more generally since Z ′Z may be near-singular when the dimensionof Z is large relative to n. Note that usual implementations of ridge regression set Λ = γ1/2IK

for a scalar penalty parameter γ. Our theory allows for the more general case but we follow thisimplemention in the simulation and empirical examples.12

In principle, one could use Υ = ΠX for Π in (3.4) in defining a regularized IV estimatorwhich essentially corresponds to one of the regularization strategies pursued in Carrasco (2012).Carrasco (2012) provides conditions under which this approach is consistent and asymptoticallynormal for estimating δ0. The drawback of this approach is that the estimated instrument Υ iscorrelated to the structural error ε by construction in finite samples. Under the conditions ofCarrasco (2012) this correlation is asymptotically negligible and thus does not adversely affectthe asymptotic properties of the regularized IV estimator. However, these conditions rule outthe case of a dense signal with number of instruments greater than the sample size. In thefollowing, we offer a complementary approach that applies in the case of a high-dimensionaldense signal.

12This choice of penalty matrix would not generally be optimal in the case where consistent model selection is

possible. Feasible optimality of IV estimators with dense first-stage signal is an interesting question as estimation

of the optimal instruments will generally not be possible in this scenario.

Page 13: Instrumental variables estimation with many weak instruments using regularized JIVE

12 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Define ΠΛ−i = (Z ′Z − ZiZ ′i + Λ′Λ)−1(Z ′X − ZiX ′i) which is the ridge regression coefficient

from running a ridge regression of X on Z with regularization matrix Λ using all but theith observation. The leave-one-out estimator Υi for the value of the instrument for the ith

individual may then be defined as Υi = Z ′iΠΛ−i. Using the constructed Υi, we then define the

ridge-regularized JIVE as

δ =

(n∑

i=1

ΠΛ−i′ZiX ′i

)−1 n∑

i=1

ΠΛ−i′Ziyi. (3.5)

By using the sample excluding the ith observation, the RJIVE breaks the correlation betweenΥi and εi in the case of a dense first-stage signal. The use of ridge in constructing this esti-mated instrument also regularizes the problem allowing signal extraction from the large set ofinstruments while avoiding overfitting. The cost of this regularization is that some signal willbe lost due to the imposed shrinkage. It seems that loss of first-stage signal will be generic inhigh-dimensional dense settings, but further exploration of this issue is warranted.

It is useful to relate the RJIVE to the formulation of the JIVE from CSHNW13 since theestimators are quite similar and exploiting this similarity allows us to simplify the proofs andtechnical details of the RJIVE. Because estimates from ridge regression can be calculated by

performing the augmented regression of Xaug =(X ′ 0′G×K

)′on Zaug =

(Z ′ Λ

)′, results

on recursive residuals as used in CSHNW continue to apply14

ΠΛ−i′Zi =

(Xaug ′Zaug(Zaug ′Zaug)−1Zi − P augii Xi

)/(1− P augii ) =

n+K∑

j 6=iP augij Xaug

j /(1− P augii )

where P aug := Zaug(Zaug ′Zaug)−1Zaug ′ is an (n+K)×(n+K) projection matrix of rank K andP augij denotes the (i, j) element of P aug. Note that P aug has rank K by construction since Λ′Λ ispositive definite, and that the first principal n×n submatrix of P aug is PΛ = Z(Z ′Z+Λ′Λ)−1Z ′.Due to the fact that Xaug

j = 0 for j > n, the expression for δ can then be simplified to

δ = H−1∑

i 6=jXiP

Λij (1− PΛ

jj)−1yj

with

H =∑

i 6=jXiP

Λij (1− PΛ

jj)−1X ′j . (3.6)

13CSHNW consider an alternative jackknife estimator which they call JIV2. We only consider the regularized

analogue of JIV1 since its use is motivated by the dense signal problem and it exhibits better performance in

simulation.14A brief derivation is provided in the appendix.

Page 14: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 13

Letting ξi = (1− PΛii )−1εi and substituting yi = X ′iδ0 + εi, we finally have

δ = δ0 + H−1∑

i 6=jXiP

Λij ξj .

In the following section, we provide conditions under which δ is consistent and asymptoticallynormal. Specifically, we provide conditions so that, after scaling, H−1 converges in probabilityto a non-singular limit and a central limit theorem applies to

∑i 6=j XiP

Λij ξj . Finally, we verify

that the asymptotic variance of δ can be consistently estimated with

V = H−1ΣH−1 (3.7)

whereΣ =

i,j

k/∈i,jPΛikP

ΛjkXiX

′j ξ

2k +

i 6=j(PΛ

ij )2XiξiX

′j ξj

andξi = (1− PΛ

ii )−1(yi −X ′i δ).15

A final consideration for implementing the RJIVE is selection of the penalty matrix Λ. Wefollow the usual approach for ridge regression by setting Λ = γ1/2In and further set γ1/2 =CK1/2. In all of our results, we set the constant of proportionality C for each first-stage equationto the sample standard deviation of the element of Xi being considered after partialing out anyincluded control variables.16 This practice is closely related to the penalty used in Dicker (2012).Further exploration of feasible penalty choice and optimality of estimators resulting under suchchoices may be an interesting avenue for additional research.

3.2. Asymptotic Properties of RJIVE. In this section, we state the conditions under whichwe derive the properties of the RJIVE and state formal results. Proofs of theorems are providedin the appendix. Throughout, we work with an asymptotic approximation that considers anarray of models where the number of instruments increases with the number of observations,n. Thus, all objects are implicitly indexed by n, but we suppress this indexing except where itwould cause confusion for notational convenience. We let C be a generic constant whose value

15For large datasets, the following expressions may be faster to compute:

H = X ′PΛX −n∑

i=1

XiPΛii X

′i

Σ =n∑

i=1

(XiX′i −XiPΛ

iiX′i −XiP

ΛiiX

′i)ξ

2i +

K∑

k=1

K∑

l=1

(n∑

i=1

ZikZilXiξi

)(n∑

j=1

ZjkZjlXiξj

)′

where X = PΛX, Xi = Xi/(1 − PΛii ) and Z = Z(Z′Z + Λ′Λ)−1. It can be shown that the two expressions are

numerically equivalent. For the non-penalized case, this was proven explicitly in CSHNW.16That is, we set the constant equal to the standard deviation of the residuals obtained by regressing X on

any control variables that are included in both the outcome and first-stage equation. In the simulations, the only

included variable is a column of ones for the intercept which produces a constant of proportionality equal to the

unconditional sample variance.

Page 15: Instrumental variables estimation with many weak instruments using regularized JIVE

14 CHRISTIAN HANSEN AND DAMIAN KOZBUR

does not depend on n but may change with each use. We let a.s. denote almost surely and a.s.n.denote a.s. for n large enough.

Assumption 1. K → ∞. Λ = Λn is a sequence of K × K positive definite penalty matricessuch that for PΛ := Z(Z ′Z + Λ′Λ)−1Z ′, PΛ

ii 6 C for some C < 1 and for all i = 1, ..., n, a.s.n.

Assumption 2. The optimal instrument Υ can be written Υi = Snυi/√n for some υi and Sn

where Sn = Sndiag(µ1n, ..., µGn), Sn is G×G and bounded and the smallest eigenvalue of SnS′nis bounded away from zero. For each 1 6 j 6 G, either µjn =

√n or µjn/

√n→ 0. Also, rn :=

(min16j6G µjn)2 → ∞ and√K/rn → 0. Finally, there is C > 0 such that ‖∑n

i=1 υiυ′i/n‖ 6 C

where ‖ · ‖ denotes the Euclidean norm, and the smallest eigenvalue λmin(∑n

i=1 υiυ′i/n) > 1/C

a.s.n.

Assumption 3. There is a constant, C, such that conditional on Z = (Υ, Z,Λ), the obser-vations (ε1, U1), ...(εn, Un) are independent with E[εi|Z] = 0 for all i, E[Ui|Z] = 0 for all i,supiE[ε2i |Z] < C, and supiE[‖U2

i ‖|Z] < C, a.s.

Assumption 1 places mild restrictions on the sequence of penalty matrices that are used toregularize the problem. The condition is akin to the usual full rank assumption on the matrix ofinstruments Z but allows for more general behavior. Importantly, this condition allows for Z tobe rank deficient and, as such, allows for K > n. The condition will be satisfied quite generallywhen the sequence of regularization matrices has minimum eigenvalue bounded away from zeroand proportional to K and can generically be made true by imposing sufficient regularization.

Assumption 2 imposes restrictions on the strength of the first-stage signal available in theinfeasible optimal instrument Υ. It is identical to Assumption 2 of CSHNW, and CSHNWprovide detailed discussion of the classes of models for optimal instruments accommodated bythis assumption. υi defined in Assumption 2 is unobserved and of the same dimension asthe infeasible optimal instrument for observation i, Υi. As such, υi is best regarded simplyas a rescaled version of this optimal instrument. The statement of the condition is generalenough to allow for strong identification of first-stage relationships when µjn =

√n for each

1 6 j 6 G, weak identification of first-stage relationships when µjn/√n→ 0 for each 1 6 j 6 G,

and situations in which some endogenous variables, those with µjn =√n, have strong first-

stage relationships and some endogenous variables, those with µjn/√n → 0, have weak first-

stage relationships. The condition that rn → ∞ rules out the case of a small number of weakinstruments as considered in, for example, Staiger and Stock (1997), Andrews and Stock (2007),and Stock, Wright, and Yogo (2002).

Assumption 3 places standard conditions on the error terms and formalizes the exclusionrestrictions. The conditions on second moments impose bounded conditional heteroskedasticity,and we also assume independence of the errors across observations.

Page 16: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 15

Next define

υ′i := Z ′i(Z′Z + Λ′Λ)−1Z ′υ =

n∑

j=1

PΛijυj = [PΛυ]i

which can be interpreted as the predicted value of υi after ridge-regularization. In addition,define the jackknife analog for prediction:

υ′i := Z ′i(Z′Z + Λ′Λ− ZiZ ′i)−1(Z ′υ − Ziυ′i).

Recall that Zi are the observed, feasible instruments that are used in estimation while υi is anunobservable related to the optimal instrument.

Assumption 4. There is a constant C > 0 such that λmin(∑n

i=1 υiυ′i/n) > C a.s.n.

This assumption is a relevance condition stating that the ridge-predicted value of the optimalinstrument is related to the unobserved optimal instrument. This condition requires that thesignal remaining after regularization is a non-vanishing fraction of the signal present beforeregularization.17 Satisfaction of this condition will require both that there was signal in theoptimal instrument as in Assumption 2 and that the problem is not so high-dimensional thatthe amount of regularization required to guarantee stable behavior of the regularized estimatorresults in loss of all the signal available in the instruments. For example, Dicker (2012) shows thatthe `2-risk of the optimal ridge-estimator with Gaussian regressors in a dense model with K/n→∞ is asymptotically the same as the risk from the trivial estimator that sets all coefficients to 0and that this trivial estimator is asymptotically minimax. The ridge-fit should be very similarto constant in this case and thus have no signal. Assumption 4 rules this and similar cases out.

In practice, the observable quantity H defined in (3.6) gives a signal about the size of∑ni=1 υiυ

′i/n, and the two quantities are approximately equal in large enough samples. This

gives the researcher a heuristic to assess the validity of Assumption 4. If it is observed thatthe eigenvalues of H are small relative to the variability of X, then Assumption 4 may bequestionable.

For concreteness, we provide a simple example that satisfies Assumption 4.

Example 1. Suppose that the set of instruments are Gaussian and independent across observa-tions and variables so that Zi ∼ N(0, IK). The first stage model is generated from Xi = Z ′iΠ+Uiwith dim(Xi) = 1, Ui satisfying the conditions in Assumption 3, and Π satisfying ‖Π‖2,IK =c0 > 0. In the context of the model, Υi = υiSn/

√n with Sn =

√n and υi = Z ′iΠ. In addition,

let K/n→ ρ ∈ (0,∞) and Λ = λKIK for every n and any fixed λ > 0. Then Assumptions 1, 2and 4 are satisfied. A proof of the statement is provided in the appendix.

17This condition can be weakened to allow a regularized signal that is non-zero, but approaching zero at

the cost of strengthening Assumption 2. We forego this generalization because it requires introduction of more

notation and makes the proofs more cumbersome.

Page 17: Instrumental variables estimation with many weak instruments using regularized JIVE

16 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Assumptions 1-4 are sufficient for consistency of RJIVE.

Theorem 1. Suppose that Assumptions 1-4 are satisfied. Then, r−1/2n S′n(δ − δ0)

p→ 0 and

(δ − δ0)p→ 0.

To establish asymptotic normality and provide a consistent estimator of the asymptotic vari-ance of δ, we impose two additional conditions.

Assumption 5. There is a constant C > 0 such that∑n

i=1 ‖υi‖4/n2 → 0,∑n

i=1 ‖υi‖4/n2 → 0,supiE[ε4i |Z] < C, and supiE[‖Ui‖4|Z] 6 C a.s.

Assumption 6. There exists C > 0 such that a.s.n supi ‖υi‖ 6 C and supi ‖υi‖ 6 C.

Assumption 5 is quite standard in the literature and assumes that various fourth momentsare bounded. Assumption 6 imposes that the (appropriately rescaled) optimal instrument isbounded almost surely and that the regularized predictions of the rescaled optimal instrumentsare bounded almost surely. The latter condition seems quite reasonable since regularized pre-dictions are generally biased towards the sample mean. Nevertheless, it is possible to exhibita sequence of optimal instruments and a sequence of regularized projection matrices such thatthe regularized predictions grow without bound. Assumption 6 rules out such sequences.

The following notation will aid the discussion and proofs of the asymptotic normality resultsthat follow. First, define Hn :=

∑ni=1 υiυ

′i/n. We will show that suitably normalized, the

difference between H and Hn vanishes in probability. In addition, as mentioned above, a centrallimit theorem will apply to the term

∑i 6=j XiP

Λij ξi. The asymptotic variance of this term will

decompose into the sum of

Ωn =∑

i

E[ξ2i |Z](υi − PΛ

ii υi)(υi − PΛii υi)

′/n, (3.8)

which corresponds to the usual limiting variance given a fixed number of instruments, and

Ψn = S−1n

i 6=jPΛij

2 (E[UiU ′i |Z]E[ξ2

i |Z] + E[Uiξ′i|Z]E[U ′jξ2j |Z]

)S−1n′ (3.9)

which can be thought of as a correction for the presence of an increasing number of instruments.Finally, the asymptotic variance of δ will take the form

Vn = H−1n (Ωn + Ψn)H−1

n .

Theorem 2. Suppose that Assumptions 1-5 are satisfied, σ2i := E[ε2i |Z] > C > 0 a.s. and K/rn

is bounded. Then Vn is nonsingular a.s.n and

V −1/2n S′n(δ − δ0) d→ N(0, I).

Page 18: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 17

As in CSHNW, asymptotic variance matrices may be singular when K/rn →∞. Such couldarise when there are different strengths of identification for different elements of X. To accom-modate the possibility of singularity, results are stated in terms of linear combinations of theRJIVE defined by a sequence of `×G matrices, Ln.

Theorem 3. Suppose that Assumptions 1-5 are satisfied and K/rn → ∞. If Ln is a boundedsequence of ` × G matrices such that λmin(LnV ∗NL

′n) > C a.s.n for some C > 0, and for

V ∗n := H−1n (rn/K)ΨnH

−1n ,

(LnV ∗nL′n)−1/2Ln

√rn/KS

′n(δ − δ0) d→ N(0, I`).

Theorems 2 and 3 provide asymptotic distributions under the case where the optimal instru-ment is strong and weak relative to the number of observed instruments. These results providean interesting comparison to the asymptotic results in BCCH. BCCH consider the case whereK may be much greater than n, allowing for K = bn exp (n1/3) for a decreasing sequence bn, andprovide consistent and asymptotically normal estimators of δ0 under the assumption that thefirst-stage signal is sparse. In our paper, we relax the condition that the first-stage is sparse but,to jointly satisfy Assumptions 1 and 4 in the dense case, implicitly impose stronger conditions onthe rate of growth of K relative to n; see, e.g. Dicker (2012). Thus, we feel the two approachesare complements. The sparse model allows one to potentially consider far more instrumentsthan in the case where one is unwilling to assume sparsity. On the other hand, the results inthis paper suggest that one can do without the sparse model assumption in the scenario whenthe number of available instruments is not very much greater than the number of observationswhich seems like a relevant scenario in practice.

Our final result verifies that the variance estimator given in 3.7 is consistent.

Theorem 4. Under Assumptions 1-6, if K/rn is bounded, then

S′nV Sn − Vnp→ 0.

If K/rn →∞, then

rnS′nV Sn/K − V ∗n

p→ 0.

4. Simulation Study

The results in the previous sections suggest that RJIVE should have good estimation andinference properties provided the sample size n is large. We demonstrate the performance ofour asymptotic approximation for RJIVE and provide a comparison with several other standardestimators using a simulation study based on the simple data generating process

yi = xiδ0 + εi

xi = Z ′iΠ + Ui, (εi, Ui) ∼ N

(0,

(σ2ε σεU

σεU σ2U

))

Page 19: Instrumental variables estimation with many weak instruments using regularized JIVE

18 CHRISTIAN HANSEN AND DAMIAN KOZBUR

where the treatment variable xi is scalar and the parameter of interest is δ0 = 1. We fix thesample size at n = 100 and the correlation between εi and Ui to ρ = .6. We consider varioussettings for the remaining parameters of the model.

First, we consider two different data-generating processes on the instruments. We are particu-larly motived by the performance of estimators in the presence of many categorical variables andtherefore consider a binary instrument design. In the binary instrument design, all instrumentsare independently drawn with Zij ∈ 0, 1 with P(Zij = 1) = .5. The second design considersGaussian instruments that are correlated with one another. Under the Gaussian instrumentdesign, all instruments are drawn with mean 0 and variance var(Zij) = .3. Dependence betweeninstruments is given by corr(Zij , Zik) = .5|j−k|. In each design, we set the number of instrumentsto K = 95 or K = 190.

We also consider two different sets of first-stage coefficients that are meant to generate denseand sparse first-stage relationships. In the dense case, the signal is determined with coefficientΠ = (ι.4K , 0.6K)′ where ιp is a 1 × p vector of ones and 0q is a 1 × q vector of zeros. In thesparse case, we set Π = (ι5, 0K−5)′ so only the first five instruments are relevant. We alter thestrength of the instruments by adjusting the noise σ2

U in the first stage regression. We measureinstrument strength using the concentration parameter µ2 = nΠ′E[Z ′iZi]Π/σ

2U . We consider a

weak and a strong first stage signal with µ2 = 30 and µ2 = 150 respectively. The remainingcomponent of the covariance matrix of the errors is the variance of the structural error whichwe fix at σ2

ε = 2.

In addition to RJIVE, we consider five alternative estimators for each setting. We reportthe Post-LASSO estimator described in BCCH, LASSO-JIVE which is an ad hoc modifiedversion of the Post-LASSO described below, the shrinkage estimator of Carrasco (2012), thestandard JIVE without regularization, and two stage least squares. The Post-LASSO estimatoris expected to perform well in the sparse design. In the dense design, LASSO is likely to selectno instruments in many simulation replications when the penalty level is set according to usualrecommendations motivated from sparse estimation which leaves the estimator undefined. Toaddress this, we consider a second, Post-LASSO-like estimator, LASSO-JIVE with a liberalpenalty level18 that allows more instruments to be selected in the first stage model. To accountfor the fact that many instruments may be selected with the liberal penalty level, we thenapply JIVE using the selected instruments. We calculate the Tikhonov-regularized version ofCarrasco’s (2012) estimator.19 JIVE is valid under many instruments and is an alternative toregularization estimators when K < n. Since JIVE is ill-defined with K > n, we proceed by

18The penalty level recommended for Post-LASSO in BCCH, for example, is proportional to√n logK. In

our simulations, we set the penalty to 2.2√

2n log(2K)σUσυ, where the standard deviations are set to their true

values. We relax the penalty by a factor of√n to 2.2

√2 log(2K)σUσυ for selecting instruments for LASSO-JIVE.

19Carrasco (2012) requires the input of a tuning (penalty) parameter. Carrasco (2012) provides an expression

for the approximate mean squared error of the estimator and suggests choosing the tuning parameter to minimize

Page 20: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 19

selecting 95 instruments at random and performing JIVE with the selected instruments whenK = 190. Finally, we consider 2SLS since it is the most common IV estimator found in theliterature and provides a natural benchmark. Just as for JIVE, we randomly select a subset of95 instruments for forming the 2SLS estimator when K = 190. We set the penalty matrix inRJIVE equal to sx

√KIK where sx is the sample standard deviation of x which is recalculated

at every simulation replication and IK is the K×K identity matrix. As another data dependentmethod for choosing the penalty parameter, we consider an additional RJIVE estimator withλ chosen to optimize a first stage cross-validation criteria since cross-validation is commonlyemployed in applied prediction settings (RJIVE-CV).20

The results are based on 1500 simulations for each setting described above. In each simulationthe data is centered so that each variable has in-sample mean zero before applying the variousestimators. For each estimator, we calculate the median bias, median absolute deviation, andrejection rate for a 5%-level test of H0 : δ0 = 1. In many of the simulations, the Post-LASSOestimator is undefined as LASSO sets all coefficients to zero. In such a case, we record a failure toreject the null which is a conservative alternative to applying the Sup-score statistic described inBCCH. Median bias and median absolute deviation for Post-LASSO are calculated conditionalon the LASSO producing at least one non-zero coefficient estimate.21 In all simulations, LASSO-JIVE selected at least one instruments with the median number of instruments selected rangingfrom 48 to 70.

The results for K = 95 are reported in Table 1. Panels A and B show results for a weaksignal (µ2 = 30). For the weak signal, RJIVE and JIVE are the only estimators that producereasonable rejection frequencies. They both show small median bias relative to median absolutedeviation. By contrast, all other estimators seem to be dominated by bias regardless of whetherthe signal is dense or sparse. This performance demonstrates the robustness of the JIVE andRJIVE in the presence of a relatively weak signal. RJIVE has considerably smaller absolutedeviation than its unregularized counterpart, JIVE. Interestingly, the performance of RJIVE-CV, which uses cross-validation to select the penalty parameter, is similar to RJIVE based onour simple, baseline penalty choice. Panels C and D show results for a stronger signal (µ2 = 150).

this criterion. We calculate the optimal tuning parameter for one simulation run assuming the true values for σ2ε ,

σεU , and Z′iΠ are known. For the remaining simulations, the same tuning parameter is used.20Specifically, we use ordinary cross-validation to calculate an optimal λ for a ridge regression based on the

entire sample xini=1, Zini=1.21In the binary simulation with K = 95, LASSO selected 0 instruments in 1437, 1359, 610, and 99 runs for

cases of dense signal with low concentration parameter, sparse signal with low concentration parameter, dense

signal with high concentration parameter, and sparse signal with high concentration parameter. With Gaussian

instruments and K = 95, LASSO selected 0 instruments in 543, 180, 0, and 0 runs. For K = 190 in the binary

simulation, reading left to right accross the table, LASSO selected 0 instruments in 1470, 1405, 833 and 139 runs.

With Gaussian instruments and K = 190, LASSO selected 0 instruments in 869, 261, 0, and 0 runs.

Page 21: Instrumental variables estimation with many weak instruments using regularized JIVE

20 CHRISTIAN HANSEN AND DAMIAN KOZBUR

As expected with a strong sparse signal, the Post-LASSO has approximately correct size in thesparse case and smaller median absolute deviation than RJIVE.

The results for K = 190 are reported in Table 2. Panels A and B again show results for aweak signal (µ2 = 30). For the weak signal, RJIVE and JIVE are again the only estimators thatproduce correct rejection frequencies. They both show small bias relative to median absolutedeviation. As in the K = 95 case, we also see little difference between RJIVE based on thetwo different penalty choices. By contrast, all other estimators seem to be dominated by biasregardless of whether the signal is dense or sparse. Panels C and D show results for a strongersignal (µ2 = 150). Once again, the Post-LASSO has approximately correct size in the sparse caseand has smaller median absolute deviation than RJIVE. This case most clearly shows the relativestrength of the Post-LASSO to the RJIVE. Post-LASSO can effectively locate a concentratedstrong signal among a set of very many instruments and thus outperforms RJIVE. The RJIVE,on the other hand, dominates all considered procedures across the remainder of the designs andcontinues to control size and has reasonable risk properties in the strong-sparse-signal case butloses efficiency relative to Post-LASSO. This loss of efficiency in the strong-sparse case appears tobe the cost of the additional robustness that RJIVE enjoys relative to Post-LASSO and wouldlikely be apparent for other sparsity-based procedures. Overall, the simulations suggest thatRJIVE may usefully complement existing approaches to estimation and inference with manyinstruments, especially in settings where sparsity is suspect.

5. Empirical Example: Angrist and Krueger (1991)

In this section, we illustrate the use of the RJIVE by revisiting the classic example in themany-instrument literature, Angrist and Krueger (1991). Interest in this example focuses onattempting to estimate the causal effect of schooling on earnings by addressing the potentialendogenity of schooling through the use of instrumental variables. The identification strategyand data from Angrist and Krueger (1991) provides many instruments which can be used forschooling, and a substantial body of literature has arisen discussing concerns about the potentialbiases and inferential problems introduced from using the full set of available instruments. See,for example, Bound, Jaeger, and Baker (1995), Angrist, Imbens, and Krueger (1999), Staigerand Stock (1997) and Hansen, Hausman, and Newey (2008).

As in Angrist and Krueger (1991), we consider the model

log(wagei) = αSchoolingi + w′iγ + εi

Schoolingi = Z ′iΠ1 + w′iΠ2 + ui

where εi and ui are unobservables that satisfy E[εi|wi, Zi] = E[ui|wi, Zi] = 0, log(wagei) is thelog(wage) of individual i, Schoolingi is the reported years of completed schooling of individuali, wi is a vector of control variables, and Zi is a vector of instrumental variables that affect

Page 22: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 21

education but do not directly affect the wage. The data were drawn from the 1980 U.S. Censusand consist of 329,509 men born between 1930 and 1939. For wi, we use a set of 510 variablesconsisting of a constant, 9 year-of-birth dummies, 50 state-of-birth dummies, and 450 state-of-birth × year-of-birth interactions. As instruments, we use three quarter-of-birth dummies andinteractions of these quarter-of-birth dummies with the full set of state-of-birth and year-of-birthcontrols in wi giving a total of 1527 potential instruments. Angrist and Krueger (1991) discussesthe endogeneity of schooling in the wage equation and provides an argument for the validity ofZi as instruments based on compulsory schooling laws and the shape of the life-cycle earningsprofile. We refer the interested reader to Angrist and Krueger (1991) for further details. Thecoefficient of interest is α, which summarizes the causal impact of education on earnings.

We report results for estimating α from several strategies and for three different instrumentsets in Table 3. Each panel in Table 3 gives the results for a different set of instruments. For eachset of instruments, we report results from the conventional 2SLS estimator in columns labeled“2SLS”, the JIVE estimator in columns labeled “JIVE”,22 the Post-LASSO estimator of BCCHin columns labeled “Post-LASSO”,23 and RJIVE in columns labeled “RJIVE”. For RJIVE, weset the ridge penalty matrix as

√Ksx|wIK where K is the number of variables in Zi, IK denotes

the K × K identity matrix, and sx|w is the sample standard deviation of the residuals fromthe OLS regression of schooling on the controls. We also report heteroskedasticity consistentstandard error estimates for each estimator.

Given knowledge of the instruments and the identification argument from Angrist and Krueger(1991), a natural set of instruments is simply the three main quarter-of-birth dummies. PanelA of Table 3 gives results from using this set of instruments. In this case, the four estimatorsconsidered give very similar results. These three instruments are relatively powerful and do notseem to result in substantial first-stage overfitting.24 As such, the 2SLS estimator is fairly well-behaved and lines up well with the other, more robust procedures. Due to the small number andrelative strength of the instruments, the ridge regularization using the suggested penalty imposesvery little regularization, and the JIVE and RJIVE estimates are nearly identical. Interestingly,LASSO only selects two of the three possible instruments and produces estimates very similarto JIVE.

In Panel B of Table 3, we report results using 180 instruments formed by using the threequarter-of-birth main effects and their interactions with the 9 main effects for year-of-birth and50 main effects for state-of-birth. Angrist and Krueger (1991) reported results from this set ofinstruments, motivating the use of the additional interactions from the standpoint of increasing

22Specifically, we use the JIV1 estimator of Phillips and Hale (1977). See also Angrist, Imbens, and Krueger

(1999) and CSHNW.23We set the penalty parameter in the LASSO according to the refined option in BCCH equation (A.21) using

residuals from the regression of schooling on the three main quarter-of-birth effects.24Hansen, Hausman, and Newey (2008) gives further discussion of this point.

Page 23: Instrumental variables estimation with many weak instruments using regularized JIVE

22 CHRISTIAN HANSEN AND DAMIAN KOZBUR

efficiency. It is now generally believed that 2SLS estimates using this set of 180 instrumentshave a substantial bias toward OLS relative to the variability of the estimator due to overfittingof the first-stage which results in potentially misleading inference about the size of the schoolingcoefficient. As such, procedures that are robust to the presence of many instruments, such asJIVE or selecting instruments via LASSO, have been advocated when this instrument set isused. In our results, we do see that the 2SLS estimator shifts substantively toward the OLSestimate of .0673 when this larger set of instruments is used. On the other hand, all three ofthe many-instrument-robust point estimates remain near the value estimated using only thethree main effects as instruments. We also see that the estimated standard error for each ofthe many-instrument-robust procedures is smaller than when only three instruments are used,suggesting there is additional signal available in the larger instrument set.25

The results reported in Panel C of Table 3 are based on using the full set of 1527 instrumentsand are the most interesting from the standpoint of the present paper. In this case, we see thatboth the Post-LASSO and JIVE point estimates have shifted substantively toward the OLSestimate. In contrast, the RJIVE is very stable, remaining around the value estimated by all ofthe procedures using only three instruments. More interesting is the fact that standard errorsfrom both JIVE and Post-LASSO are now pronouncedly larger than the standard error fromthe RJIVE. The increase in standard errors for Post-LASSO is due to the fact that LASSO nowonly selects one variable. This reduction in the number of variables selected is due to the factthat LASSO requires a higher level of signal from each variable before it allows it to enter asa significant predictor with a larger number of variables. Thus, the LASSO estimator remainsreasonably stable in this example due to the fact that one of the quarter of birth instrumentshas a substantial amount of predictive power but may be discarding useful small signals and beinefficient. On the other hand, the leave-one-out fits used as instruments in the JIVE are highlyvariable due to the large number of essentially uninformative variables used in the first-stagewhich results in a very high many-instrument-robust standard error. The RJIVE effectivelyregularizes these uninformative signals out of the problem while apparently capturing more ofthe signal than LASSO, producing an estimator that remains stably at the value obtained whenonly the strong instruments are used while having a smaller estimated variance.

These results demonstrate that RJIVE produces sensible and what appear to be relativelyhigh-quality estimates in this application. As with LASSO, the RJIVE behaves stably withoutrequiring a priori information about which are the relevant instruments and produces estimatesthat are very similar to those obtained from other leading approaches to estimation and inferencewhen this information is used. Relative to LASSO, the RJIVE appears to use more signal inthis example which may be due to the fact that the signal available in the many state-of-birth ×year-of-birth × quarter-of-birth interactions is not sparse; i.e. there may be many of these termsthat have small effects but provide valuable information in aggregate. These results suggest

25In this case, LASSO selects five instruments.

Page 24: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 23

that RJIVE may provide a useful complement to currently advocated approaches to dealingwith many instruments.

6. Conclusion

To improve efficiency of classical IV techniques, researchers may want to make use of manyinstruments in order to have stronger predictions of the exogenous variation in the treatmentvariable of interest. However, many traditional IV techniques perform poorly when many in-struments are used. RJIVE gives a feasible method for using the information present in a largenumber of instruments. The most important feature of RJIVE is it remains consistent andapproximately normal, allowing simple valid inference for treatment effects, even without thepresence of a strong and sparse first stage signal. The ability to perform well without requir-ing a sparse first stage is in contrast to high-dimensional IV estimators that rely on variableselection. The dense signal case, where all instruments potentially contribute to variation in thetreatment variable, seems like an important setting in practice. The RJIVE is also conceptuallystraightforward and computationally simple. We show it performs well relative to other manyinstrument robust procedures through simulations and that it also seems to perform well in theclassic Angrist and Krueger example. The results in this paper suggest that the RJIVE mayprovide researchers with a useful tool when faced with many instruments.

7. Appendix

7.1. Proofs of Main Theorems. In the following, we provide proofs of Theorem 1-4. Theproofs follow closely the arguments of CSHNW but are modified to account for regularizationin performing the jackknife. Auxiliary lemmas used in proving these results are collected inAppendix 7.2.

7.1.1. Proof of Theorem 1.

Proof. First we note that S′n(δ − δ0)/rnp→ 0 implies δ

p→ δ0. This is because

‖S′n(δ − δ0)/√rn‖ >

√λmin(SnS′n/rn)‖δ − δ0‖ >

√λmin(SnS′n)‖δ − δ0‖ > C‖δ − δ0‖.

Therefore, it suffices to prove the statement S′n(δ − δ0)/√rn

p→ 0. The strategy is to show thisconditional on Z = (Υ, Z,Λ) as defined in Assumption 3, and then apply dominated convergence.

Observe that, conditional on Z, λmin(S−1n HS−1

n′) = λmin(

∑ni=1 υiυ

′i/n + oP (1)) by Lemma

6. Then λmin(S−1n HS−1

n′) > λmin(

∑ni=1 υiυ

′i/n) + oP (1) > C + oP (1) where the last inequality

follows from Assumption 4 and holds a.s.n. Therefore, (S−1n HS−1

n′)−1 = Op(1).

Second, observe that conditional on Z, S−1n

∑i 6=j XiP

Λij ξj/

√rn = OP (1 +

√K/rn)/

√rn =

oP (1) follows from Lemma 5 provided that√K/rn → 0.

Page 25: Instrumental variables estimation with many weak instruments using regularized JIVE

24 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Putting these together shows that conditional on Z,

r−1/2n S′n(δ − δ0) = (S−1

n HS−1n′)−1S−1

n

i 6=jXiP

Λij ξj/

√rn = OP (1)oP (1)

p→ 0.

To this point, every statement was conditional on Z. The unconditional statement follows bydominated convergence. Let Rn = r

−1/2n S′n(δ − δ0). Then the above argument shows that for

any constant v > 0, a.s., P(‖Rn‖ > v|Z)→ 0. Then by dominated convergence, P(‖Rn‖ > v) =E[P(‖Rn‖ > v|Z)]→ 0. Since v was arbitrary, it follows that Rn

p→ 0 proving the theorem.

7.1.2. Proof of Theorem 2.

Proof. Let Yn = S−1n

∑i 6=j XiP

Λij ξj and notice that it can be decomposed as

Yn =∑

i

(υi − PΛii υi)ξi/

√n+ S−1

n

i 6=jUiP

Λij ξj .

Let Γn = var(Yn|Z) so that

Γn =n∑

i=1

(υi − PΛii υi)(υi − PΛ

ii υi)′E[ξ2

i |Z]/n

+S−1n

i 6=j(PΛ

ij )2(E[UiU ′i |Z]E[ξ2

i |Z] + E[Uiξ′i|Z]E[U ′jξ2j |Z]

)S−1n′.

The main strategy is to prove a central limit theorem involving Yn and Γn. The tool thatgives us the central limit result is Lemma 2. Therefore, we begin the proof by verifying theassumptions of Lemma 2. We start by showing that Γn is nondegenerate and bounded.

Since PΛii < C a.s.n., E[ξ2

i |Z] = (1 − PΛii )−2E[ε2i |Z] > C a.s.n. Then in the positive definite

sense

Γn n∑

i=1

(υi − PΛii υi)(υi − PΛ

ii υi)′E[ξ2

i |Z]/n Cn∑

i=1

(υi − PΛii υi)(υi − PΛ

ii υi)′/n

Now, let ω be a G × 1 vector with norm ‖ω‖ = 1. Then note that by the Cauchy-SchwartzInequality and λmin(

∑i υiυ

′i/n) > C > 0 from Assumption 4,

0 < C < ω′n∑

i=1

υiυ′iω/n =

n∑

i=1

ω′υi(1− PΛii )−1(υi − PΛ

ii υi)′ω/n

6

√√√√n∑

i=1

ω′(υi − PΛii υi)(υi − PΛ

ii υi)′ω/n

√√√√n∑

i=1

ω′(1− PΛii )−2υiυ′iω/n.

Page 26: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 25

Therefore

√√√√n∑

i=1

ω′(υi − PΛii υi)(υi − PΛ

ii υi)′ω/n > C

√√√√

n∑

i=1

ω′(1− PΛii )−2υiυ′iω/n

−1

.

By PΛii 6 C < 1, it follows that

∑ni=1(1 − PΛ

ii )−2υiυ

′i/n 6

∑ni=1Cυiυ

′i/n in the positive

definite sense. This fact and ‖∑ni=1 υiυ

′i/n‖ 6 C from Assumption 2 give that

√√√√

n∑

i=1

ω′(1− PΛii )−2υiυ′iω/n

−1

> C.

Therefore,√∑n

i=1 ω′(υi − PΛ

ii υi)(υi − PΛii υi)′ω/n > C > 0 which in turn implies that λmin(Γn) >

C > 0 a.s.n. This regularity in Γn will justify using Lemma 2.

Note additionally that ‖Γn‖ 6 C a.s.n. because∑

i 6=j PΛij/K 6 1, and ‖∑n

i=1 υυυ′i/n‖ < C

and E[ξ2i |Z] < C. Therefore, the eigenvalues of Γ−1

n are bounded away from zero a.s.n.

In anticipation of using the Cramer-Wold device, let α be a nonzero G× 1 vector.

Let Win = (υi − PΛii υi)ξi/

√n. Let c1n = Γ−1/2

n α, and let c2n =√KS−1

n Γ−1/2n α. Next, we

show that all the conditions of Lemma 2 are satisfied. Condition (i) is satisfied by Assumption1. Condition (ii) is satisfied since D1,n Γn. Condition (iii) is satisfied by Assumption 3.Condition (iv) is satisfied by

n∑

i=1

E[‖Wi,n‖4|Z] =1n2

n∑

i=1

E[‖(υi − PΛii υi)ξi‖4|Z]

=1n2

i

E[‖(υi − PΛii υi)‖4|Z]E[ξ4

i |Z]

6 1n2

n∑

i=1

E[‖(υi − PΛii υi)‖4|Z] · C

6 1n2

n∑

i=1

E[24−1‖υi‖4 + 24−1‖PΛii υi‖4|Z] · C

6 1n2C

n∑

i=1

E[‖υi‖4|Z] +1n2C

n∑

i=1

E[‖υi‖4|Z].

By Assumption 5, both terms above are vanishing as n→∞. Finally, condition (v) of Lemma2 is satisfied by Assumption 1.

Note that c1n = Γ−1/2n α and c2n = (

√K/rn)

√rnS

−1n Γ−1/2

n satisfy ‖c1n‖ 6 C and ‖c2n‖ 6 C

a.s.n because of the boundedness of√K/rn,

√rnS

−1n , and Γ−1

n . Also, Ξn in Lemma 2 is givenby

Page 27: Instrumental variables estimation with many weak instruments using regularized JIVE

26 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Ξn = c′1nDnc1n + c′2nΣnc2n = var(α′Γ−1/2n

′Yn|Z) = α′α.

An application of lemma 2 yields that

(α′α)−1/2α′Γ−1/2n Yn = Ξ−1/2

n

n∑

i=1

c′1nWin + c′2n∑

i 6=jUiP

Λij ξj/

√K

:= Yn

d→ N(0, 1) a.s.

Therefore, α′Γ−1/2n Yn

d→ N(0, α′α) a.s., so by the Cramer-Wold device, Γ−1/2n Yn

d→ N(0, IG)a.s.

Now, recall Vn is defined by Vn = H−1n ΓnH−1

n for Hn =∑n

i=1 υiυ′i/n. Let Qn = V

−1/2n HnΓ1/2

n .

Qn is an orthogonal matrix since QnQ′n = V−1/2n HnΓ1/2

n Γ1/2n

′HnV

−1/2n

′= V

−1/2n VnV

−1/2n

′= In.

In addition, Qn depends only on Z.

Therefore,

V −1/2n (S−1

n HS−1n′)−1Γ−1/2

n = V −1/2n (Hn + oP (1))Γ1/2

n = Qn + oP (1).

Note that because Γ−1/2n Yn

d→ N(0, IG) a.s., and Qn is only a function of Z, we have thatQnΓ−1/2

n Ynd→ N(0, IG). Then by the Slutsky Lemma and δ = δ0 + H−1

∑i 6=j XiP

Λij ξj , we have

V −1/2n S′n(δ − δ0) = V −1/2

n (S−1n H−1S−1

n′)−1S−1

n

i 6=jXiP

Λij ξj

= V −1/2n (S−1

n H−1S−1n′)−1(Yn + oP (1))

= (Qn + oP (1))(Γ−1/2n Yn + oP (1))

= QnΓ−1/2n Yn + oP (1) d→ N(0, IG)

7.1.3. Proof of Theorem 3. The proof is similar to the proof of Theorem 2. We redefine thesequence Yn and again obtain a central limit result under the theorem’s hypotheses.

Proof. Because rn/K → 0, we have that√rn/K

∑i(υi − PΛ

ii υi)ξi/√n

p→ 0. Therefore, thatterm is negligable in the expansion

√rn/KS

−1n

i 6=jXiP

Λij ξj =

√rn/K

i

(υi − PΛii υi)ξi/

√n+

√rn/KS

−1n

i 6=jUiP

Λij ξj .

Thus, we consider only the second term on the right hand side and redefine

Yn :=√rnS

−1n

i 6=jUiP

Λij ξj/

√K

Page 28: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 27

for this proof. Let Γn be the conditional variance,

Γn = var(Yn|Z) = rns−1n

i 6=j(PΛ

ij )2(E[UiU ′i |Z]E[ξ2

j |Z] + E[Uiξi|Z]E[U ′jξj |Z])S−1n′/K.

As before, ‖Γn‖ 6 C a.s.n. Let Ln be any sequence of bounded matrices with λmin(LnΓnL′n) >

C > 0 a.s.n. Let Y Ln = (LnΓnL

′n)−1/2LnYn. We apply lemma 2 in a similar fashion as

for proving Theorem 2. Let α be a nonzero vector. Consider Win = 0, C1n = 0, c2n =α′(LnΓnL

′n)−1/2Ln

√rnS

−1n . Then var(c′2n

∑i 6=j UiP

Λij ξj/

√K|Z) = α′α > 0 and Lemma 2 im-

plies that α′Y Ln

d→ N(0, α′α) a.s. Therefore, Y Ln

d→ N(0, I`).

Next, for Ln given in the hypothesis of the theorem, let Ln = LnH−1n . This way, LnV ∗nL

′n =

LnΓnL′n. Applying Lemma 6 gives (S−1

n HS−1n′)−1 = H−1

n + oP (1), from which it follows that

(LnΓnL′n)−1/2Ln(S−1

n HS−1n′)−1 = (LnΓnL

′n)−1/2Ln(Hn + oP (1)) = (LnΓnL

′n)−1/2Ln + oP (1).

Finally,√rn/KS

−1n

∑i 6=j XiP

Λij ξj = OP (1) by Lemma 5, so

(LnV ∗nL′n)−1/2Ln

√rn/KS

′n(δ − δ0)

= (LnΓnL′n)−1/2Ln(LnΓnL

′n)−1

√rn/KS

−1n

i 6=jXiP

Λij ξj

= (LnΓnL′n)−1/2Ln + oP (1))(Yn + oP (1))

= Y Ln + oP (1) d→ N(0, I`).

7.1.4. Proof of Theorem 4. The following notations will be used in the proof of the theorem.They will also appear in the proofs of Lemmas 7 and 8 below.

Definition 1. The following definitions are convenient for Lemmas 7 and 8 as well as for provingTheorem 4:

(i) Xi = S−1n Xi,

(ii) Σ1 =∑

i 6=j 6=k XiPΛikξ

2i P

ΛkjX

′j ,

(iii) Σ1 =∑

i 6=j 6=k XiPΛikξ

2i P

ΛkjX

′j ,

(iv) Σ2 =∑

i 6=j PΛij

2(XiX′i ξ

2i + Xiξj ξjX

′j),

(v) Σ2 =∑

i 6=j PΛij

2(XiX′iξ

2i + XiξjξjX

′j).

Proof. We begin by expressing Ωn + Ψn (defined in the text in equations (3.8) and (3.9)) interms of Σ1 and Σ2.

Page 29: Instrumental variables estimation with many weak instruments using regularized JIVE

28 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Ωn + Ψn =∑

i

E[ξ2i |Z](υi − PΛ

ii υi)(υi − PΛii υi)

′/n

+ S−1n

i 6=jPΛij

2 (E[UiU ′i |Z]E[ξ2

i |Z] + E[Uiξ′i|Z]E[U ′jξ2j |Z]

)S−1n′

=∑

i

E[ξ2i |Z](υi − PΛ

ii υi)(υi − PΛii υi)

′/n−∑

i 6=jPΛij

2ziυ′iE[ξ2

i |Z]

+∑

i 6=jPΛij

2ziυ′iE[ξ2

i |Z]

+ S−1n

i 6=jPΛij

2 (E[UiU ′i |Z]E[ξ2

i |Z] + E[Uiξ′i|Z]E[U ′jξ2j |Z]

)S−1n′

=∑

i

E[ξ2i |Z](υi − PΛ

ii υi)(υi − PΛii υi)

′/n

−∑

i 6=jPΛij

2ziυ′iE[ξ2

i |Z] + Σ2 + oP (K/rn) (by Lemma 8)

=∑

i

E[ξ2i |Z](υiυ′i − PΛ

ii υiυ′i − PΛ

ii υiυ′i

+ PΛii

2υiυ′i)/n−

i 6=jPΛij

2ziυ′iE[ξ2

i |Z] + Σ2 + oP (K/rn)

=∑

i 6=j 6=kυiP

ΛikE[ξ2

i |Z]PΛkjυ′j/n+ Σ2 + oP (K/rn)

= Σ1 + oP (1) + Σ2 + oP (K/rn) (by Lemma 8)

= Σ1 + oP (1) + Σ2 + oP (K/rn) (by Lemma 7).

In the case that K/rn is bounded, Lemma 6 implies that

S′nV Sn = (S−1n HS−1

n′)−1(Σ1 + Σ2)(S−1

n HS−1n′)−1

= (H−1n + oP (1))(Ωn + Ψn + oP (1))(H−1

n + oP (1)) = Vn + oP (1)

due to H−1n and Ωn + Ψn being bounded a.s.n.

When K/rn →∞, (rn/K)(Σ1 + Σ2) = (rn/K)Ψn + (rn/K)Ωn + oP (1) = (rn/K)Ψn + oP (1).Since (rn/K)Ψn is bounded a.s.n, we have

S′nV Sn = (S−1n HS−1

n′)−1(Σ1 + Σ2)(S−1

n HS−1n′)−1

= (H−1n + oP (1))((rn/K)Ψn + oP (1))(H−1

n + oP (1)) = V ∗n + oP (1).

Page 30: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 29

7.2. Lemmas. In this appendix, we provide several lemmas that are in used in proving Theo-rems 1-4.

The first lemma bounds deviations of quantities of the form∑n

i 6=j PΛijWiYj from their expec-

tations over generic random variables Wi and Yi conditional on PΛij in a way made more precise

below.

Lemma 1. Lemma A1 of CSHNW holds with the idempotent matrix P replaced by PΛ. Thatis, suppose PΛ ≡ PΛ(Z) = Z(Z ′Z + Λ′Λ)Z ′. Let (Wi, Yi) be generic scalar random variables. Ifconditional on Z (defined hereafter by Z := (Υ, Z,Λ)), (Wi, Yi) are independent a.s., then thereis a constant C such that

‖n∑

i 6=jPΛijWiYj −

n∑

i 6=jPΛijwiyj‖2L2,Z 6 CBn, a.s.n

where wi = E[Wi|Z], yi = E[Yi|Z], Bn = Kσ2Wnσ2Yn

+σ2Ynw′nwn+σ2

Wny′nyn and in the definition

of Bn, wn = E[(W1, ...,Wn)′|Z], yn = E[(Y1, ..., Yn)′|Z], σWn = maxi6n var(Wi|Z)1/2, σYn =maxi6n var(Yi|Z)1/2. Finally, the norm in the above bound is defined by ‖.‖2L2,Z = E[(.)2|Z].

Proof. As in the text, define the (n+K)×K augmented data matrix Zaug :=(Z ′ Λ′

)′and the

(n+K)× (n+K) augmented projection matrix P aug := Zaug(Zaug ′Zaug)−1Zaug ′. In addition,define new (degenerate) random variables Wn+1, ...,Wn+K = 0; Yn+1, ..., Yn+K = 0. Note thatbecause PΛ is identical to the principal n × n submatrix of P aug, the equality of the followingsums hold:

∑ni 6=j P

ΛijWiYj =

∑n+Ki 6=j P augWiYj and

∑ni 6=j P

Λijwiyj = E

[∑ni 6=j P

ΛijWiYj |Z

]=

E[∑n+K

i 6=j P augWiYj |Z]

=∑n+K

i 6=j P augwiyj .

This implies that

‖n∑

i 6=jPΛijWiYj −

n∑

i 6=jPΛijwiyj‖2L2,Z = ‖

n+K∑

i 6=jP augij WiYj −

n+K∑

i 6=jP augij wiyj‖2L2,Z 6 CBn+K = CBn

The inequality in the previous display holds by Lemma A1 of CHSNW, noting that therenow n+K observations used in the application of the CHSNW lemma, since P aug is symmetricand idempotent of rank K. CBn = CBn+K holds since Wn+1, ...,Wn+K and Yn+1, .., Yn+K aredegenerate and therefore σ2

Wn= σ2

Wn+K, σ2

Yn= σ2

Yn+K, y′nyn = y′n+Kyn+K , w′nwn = w′n+Kwn+K

and rank(PΛ) = rank(P aug) = K.

The second lemma is a central limit result which is useful for proving Theorem 2 and Theorem3. As in Lemma 1, W denotes a generic random vector.

Lemma 2. Lemma A2 of CSHNW holds with PΛ replacing P . That is, suppose that the fol-lowing hold conditional on Z:

Page 31: Instrumental variables estimation with many weak instruments using regularized JIVE

30 CHRISTIAN HANSEN AND DAMIAN KOZBUR

(i) PΛ = PΛ(Z) = Z(Z ′Z + Λ′Λ)−1Z ′;

(ii) (W1n, U1, ξ1), ..., (Wnn, Un, ξn) are independent, and D1,n :=∑n

i=1E[WinW′in|Z] satisfies

‖D1,n‖ 6 C a.s.n;

(iii) E[W ′in|Z] = 0, E[Ui|Z] = 0, E[ξi|Z] = 0, and there is a constant C such that E[‖Ui‖4|Z] 6C and E[‖ξi‖4|Z] 6 C;

(iv)∑n

i=1E[‖Win‖4|Z] a.s.→ 0; and

(v) K →∞ as n→∞.

Then for

D2,n :=∑

i 6=j(PΛ

ij )2(E[UiU ′i |Z]E[ξ2

j |Z] + E[Uiξi|Z]E[ξjU ′j |Z])/K

and any sequences c1n and c2n depending on Z of conformable vectors with ‖c1,n‖ 6 C, ‖c2,n‖ 6C, and Ξn = c′1,nD1,nc1,n + c′2,nD2,nc2,n > 1/C a.s.n. it follows that

Yn := Ξ−1/2n

c′1,n

n∑

i=1

Win + c′2,n∑

i 6=jUiP

Λij ξj/

√K

d→ N(0, 1), a.s.

Proof. In order to prove the lemma, we would like to apply Lemma A2 of CSHNW using thesame augmentation argument that was used to prove Lemma 1. However, the augmentationcannot be applied immediately because the augmented projection matrix defined previously,P aug, need not satisfy P augii 6 C < 1 which is a hypothesis of Lemma A2 in CSHNW. Herewe note that the conditions in Lemma A2 of CSHNW read identically to the conditions inour Lemma 2 with the exception of condition (i). Instead, theirs reads: “(i) P = P (Z) is asymmetric, idempotent matrix with rank(P ) = K and Pii 6 C < 1.” We begin by showing thattheir condition (i) can be weakened so that the result continues to hold for general sequences ofprojection matrices P of rank K that do not satisfy Pii 6 C < 1. Once this is accomplished,the augmentation argument will be valid.

Suppose that (W1n, U1, ξ1), ..., (Wnn, Un, ξn) satisfy conditions (ii) - (v) of Lemma A2 inCSHNW and that the sequences of matrices P = Pn are projection matrices of rank K butdon’t necessarily satisfy Pii 6 C < 1 . In addition, suppose the vectors c1n and c2n satisfy thehypotheses of the lemma: ‖c1,n‖ 6 C, ‖c2,n‖ 6 C, and Ξn := c′1,nD1,nc1,n + c′2,nD2,nc2,n > 1/C,for D1,n and D2,n defined in the statement of the lemma.

We proceed by defining new sequences which are shown to converge in distribution appropri-ately. The new sequences will then be related to the sequence of interest, Yn. This will implythat Yn converges appropriately. The new sequences are denoted by overhead hats and given by

Page 32: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 31

(Wi2n, Ui, ξi) :=

(Win, Ui, ξi) if i 6 n

(0, 0, 0) otherwise

Then define new 2n × 2n matrices P2n :=

(12Pn

12Pn

12Pn

12Pn

)where Pn is the projection matrix

defined in the previous paragraph. The diagonal elements of the newly defined matrices satisfy[P2n]ii 6 1

2 since [Pn]ii 6 1. Note also that P2n are symmetric and idempotent and have rankK = Kn. Define K2n := rank(P2n).

Finally, letc1,2n := (c′1n 0′n×1)′, c2,2n := 2(c′2n 0n×1)′,

For

D2,2n :=2n∑

i 6=j(P2n)2

ij

(E[UiU ′i |Z]E[ξ2

j |Z] + E[Uiξi|Z]E[ξjU ′j |Z])/K2n,

the equality D2,2n = 14D2,n holds. In addition, define D1,2n :=

∑2ni=1E[WinW

′in] and note that

this definition implies D1,2n = D1,n . These equalities, together with the definition Ξ2n :=c′1,2nD2nc1,2n + c′2,2nΣnc2,2n, imply that Ξ2n = Ξn. Therefore, Ξ2n > 1/C since Ξn > 1/C. ThenLemma A2 of CSHNW can be applied to Y2n (if desired, a similar construction can be done thatachieves Y2n+1 = Yn):

Y2n := Ξ−1/22n

c′1,2n

2n∑

i=1

Wi2n + c′2,2n

2n∑

i 6=jUi[P2n]ijξj/

√K2n

d→ N(0, 1), a.s.

To relate Y2n back to the original sequence Yn, note that c′1,2n∑2n

i=1 Wi2n = c1,n∑n

i=1Win and

that c′2,2n∑2n

i 6=j Ui[P2n]ij ξj/√K2n = 2c′2,n

∑ni 6=j Ui[

12Pn]ijξj/

√K = c′2,n

∑ni 6=j Ui[Pn]ijξj/

√K.

Therefore, Y2n = Yn. Then by the convergence of Y2n, it follows that

Ynd→ N(0, 1) a.s.

This shows that Lemma A2 of CSHNW still holds without requiring Pii 6 C < 1 provided Pare projection matrices. Therefore, the same augmentation argument as was given for the proofof Lemma 1 can now be used to show Lemma 2.

Lemmas 3 and 4 bound quantities to their expectations in a similar spirit as Lemma 1. Theyare useful for proving Lemmas 7 and 8.

Lemma 3. Lemma A3 of CSHNW holds with PΛ replacing P . That is, if conditional on Z,(Wi, Yi),i = 1, ..., n are independent scalars, then there is C > 0 such that almost surely,

‖∑

i 6=jPΛij

2WiYj − E[

i 6=jPΛij

2WiYj ]‖2L2,Z 6 CB

′n

Page 33: Instrumental variables estimation with many weak instruments using regularized JIVE

32 CHRISTIAN HANSEN AND DAMIAN KOZBUR

where B′n := K(σ2Wσ

2Y + σ2

Wµ2Y + µ2

Wσ2Y

), σ2

W and σ2Y use the same notation as Lemma 1 and

µ2W := maxi=1,...,nE[Wi|Z], µ2

Y := maxi=1,...,nE[Yi|Z].

Proof. The same augmentation argument as used for Lemma 1 holds.

Lemma 4. (Modified from CSHNW for RJIVE) Suppose there is a constant C > 0 such that,conditional on Z, the generic random variables (W1, Y1, η1), ..., (Wn, Yn, ηn) are independent withE[Wi|Z] = ai/

√n, E[Yi|Z] = bi/

√n, with |ai| < C, |bi| < C, E[η2

i |Z] < C, var(Wi|Z) < C/rn,var(Yi|Z) < C/rn, and maxi=1,...,n |[PΛW ]i| 6 C/

√n where W = (W1, ...,Wn)′, [PΛW ]i denotes

the ith component of PΛW , and rn is a sequence such that rn →∞. Then

An := E

i 6=j 6=kWiP

ΛikηkP

ΛkjYj |Z

= OP (1), and

i 6=j 6=kWiP

ΛikηkP

ΛkjYj −An

p→ 0

Proof. As was the case in the proof of Lemma 2, we cannot directly apply the augmentationargument. The general augmentation idea will still work, but will require a slight modification.Let Wn+1, ...,Wn+K = Yn+1, ..., Yn+K = ηn+1...ηn+K = 0. Define Zaug and P aug as above.The reason that Lemma A4 as stated in CSHNW does not immediately apply is that we donot assume (nor does it make sense for the augmented variables) that there is πn such thatmaxn+16i6n+K |ai−Zaug ′iπn|

a.s.→ 0. Instead, we use the condition maxi=1,...,n |[PΛW ]i| 6 C/√n.

This is the only alteration needed. Let

Aaugn = E

n+K∑

i 6=j 6=kWiP

augik ηkP

augkj Yj |Z

and note that An = Aaugn because of the definitions of Wi, Yi, ηi for i > n.

Following closely the notation given in CSHNW, let

ψ1 =n+K∑

i 6=j 6=kwiP

augik ηkP

augkj yj , ψ2 =

n+K∑

i 6=j 6=kwiP

augik ηkP

augkj yj

ψ3 =n+K∑

i 6=j 6=kwiP

augik ηkP

augkj yj , ψ4 =

n+K∑

i 6=j 6=kwiP

augik ηkP

augkj yj

ψ5 =n+K∑

i 6=j 6=kwiP

augik ηkP

augkj yj , ψ6 =

n+K∑

i 6=j 6=kwiP

augik ηkP

augkj yj

ψ7 =n+K∑

i 6=j 6=kwiP

augik ηkP

augkj yj

Page 34: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 33

where conditional means are denoted ηi = E[ηi|Z], wi = E[Wi|Z], and yi = E[Yi|Z] anddeviation from means are denoted ηi = ηi − ηi, Wi = Wi − wi and Yi = Yi − yi.

Then, as noted before, An = Aaugn , and algebraic manipulation gives

An = Aaugn =n+K∑

i 6=j 6=kWiP

augik ηkP

augkj Yj −

7∑

r=1

ψr.

This can be seen by noting that

WiPaugik ηkP

augkj Yj = WiP

augik ηkP

augkj Yj + wiP

augik ηkP

augkj Yj

= WiPaugik ηkP

augkj Yj + WiP

augik ηkP

augkj Yj + wiP

augik ηkP

augkj Yj + wiP

augik ηkP

augkj Yj

= WiPaugik ηkP

augkj Yj + WiP

augik ηkP

augkj yj + WiP

augik ηkP

augkj Yj + WiP

augik ηkP

augkj yj

+ wiPaugik ηkP

augkj Yj + wiP

augik ηkP

augkj yj + wiP

augik ηkP

augkj Yj + wiP

augik ηkP

augkj yj

summing over i 6= j 6= k gives

n+K∑

i 6=j 6=kWiP

augik ηkP

augkj Yj = wiP

augik ηkP

augkj yj +

7∑

r=1

ψr

= Aaugn +7∑

r=1

ψr.

The proof Lemma 4 is then completed after showing that Aaugn = OP (1) and that ψrp→ 0

for r = 1, ..., 7. A careful verification of the argument in CSHNW reveals that the argumentsshowing that ψr

p→ 0 for r ∈ 1, 2, 3, 4, 5, 7 and that An = OP (1) remain valid in our setting.Therefore, all that is left is showing that ψ6

p→ 0. First observe that E[ψ6|Z] = 0. Therefore,we show that E[ψ2

6|Z]→ 0 and so by Markov inequality, ψ6p→ 0.

As before, let

µ2W = maxi6nw2

i , µ2Y = maxi6n y2

i , µη = maxi6n η2i ;

σ2W = maxi6n var(Wi|Z), σ2

Y = maxi6n var(Yi|Z), σ2η = maxi6n var(ηi|Z)

Note that µW 6 C/n, µ2Y 6 C/n, µη 6 C, and σ2W 6 C/rn, σ2

Y 6 C/rn, σ2η 6 C. Also, let

wi =∑n+K

j=1 P augij wj and yi =∑n+K

j=1 P augij yj .

Then for i 6= k,∑n+K

j/∈i,kwiPaugik P augkj yj = wiP

augik yk −P augik P augkj yj −wiP augik P augkk yk. For fixed

k, it follows that

n+K∑

i 6=k

n+K∑

j /∈i,kwiP

augik P augkj yj =

n+K∑

i=1

(wiP

augik yj − wi(P augik )2yi − wiP augik P augkk yk

)

− wkP augkk yk + 2wk(Paugkk )2yk

Page 35: Instrumental variables estimation with many weak instruments using regularized JIVE

34 CHRISTIAN HANSEN AND DAMIAN KOZBUR

= wkyk −n+K∑

i=1

wi(Paugik )2yi − wkP augkk yk − wkP augkk yk + 2wk(P

augkk )2yk

Viewing the above expression as a sum of five terms, we use the the fact that (A1 +...+A5)2 65(A2

1 + ...+A25) for any numbers A1, ...A5, the following sequence of inequalities hold:

E[ψ26|Z] =

n+K∑

k=1

E[η2k|Z]

n+K∑

i 6=k

n+K∑

j /∈i,kwiP

augik P augkj yj

2

6 5n+K∑

k=1

E[η2k|Z]

w2

ky2k +

n+K∑

i,j

(P augkj )2(P augki )2wiyiwjyj

+ w2

k(Paugkk )2y2

k + w2k(P

augkk )2y2

k + w2k(P

augkk )4y2

k

Next we continue to simplify:

6 5σ2η

n∑

k=1

w2ky

2k + µ2

Wµ2Y

n∑

i,j,k

(P augkj )2(P augki )2 + µ2Y

n∑

k=1

w2k + µ2

W

n∑

k=1

y2k + 4nµ2

Wµ2Y

6 5σ2η

(n∑

k=1

w2ky

2k + 7nµ2

Wµ2Y

)6 C

n∑

k=1

w2ky

2k + Cn/n2 6 C

n∑

k=1

w2ky

2k + o(1)

6 (maxi6n|wi|)2

n∑

k=1

y2k = o(1)

n∑

k=1

y2k → 0.

This completes the proof of the Lemma 4.

Lemmas 5 and 6 are used directly in the proof of Theorem 1 to describe the convergence ofthe component terms H−1 and

∑i 6=j XiP

Λij ξj found in the estimator δ. They rely on the result

in Lemma 1.

Lemma 5. If Assumptions 1-3 are satisfied then

(i) S−1n HS−1

n =∑

i 6=j υiPΛij (1− PΛ

jj)−1υ′j/n+ oP (1).

(ii). S−1n

∑i 6=j XiP

Λij (1− PΛ

jj)−1εj = OP (1 +

√K/rn).

Proof. The proof is similar to the proof of Lemma A5 in Chao et. al. with our Lemma 1replacing their Lemma A1. It is included for convenience. The strategy will be to apply Lemma1 repeatedly to different components of the quantities of interest to obtain the desired bounds.Let ek be the k-th unit vector and apply Lemma 1 with Yi = e′kS

−1n Xi = υik/

√n+ e′kS

−1n Ui and

Page 36: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 35

Wi = e′lS−1n Xi(1−PΛ

ii )−1 for some k and l between 1 and G. By assumption 2, λmin(Sn) > C√rn

which implies that ‖S−1n ‖ 6 C/

√rn. Therefore, a.s., all of the following hold:

yi := E[Yi|Z] = υik/√n, var(Yi|Z) 6 C/rn

wi := E[Wi|Z] = υik(1− PΛii )−1/√n, var(Wi|Z) 6 C/rn.

Then, for σYn , σWn , y and w defined in Lemma 1, it follows that, a.s.,√KσWnσYn 6 C

√K/rn → 0

σWn

√y′y 6 Cr−1/2

n

√√√√n∑

i=1

υ2ik/n→ 0

σYn√w′w 6 Cr−1/2

n

√√√√n∑

i=1

υ2il(1− PΛ

ii )−2/n 6 Cr−1/2n (1−max

iPΛii )−2

√√√√n∑

i=1

υ2il/n→ 0

where the last convergence follows from (1 − maxiPΛii ) < C < 1 a.s.n. by Assumption 1.

Then∑

i 6=j YiPΛijWj = e′kS

−1n

∑i 6=j P

ΛijX

′jS−1n′el/(1 − PΛ

jj) = e′kS−1n HS−1

n′el and PΛ

ijwiyj =PΛijυikυjl/n(1− PΛ

jj), and so Lemma 1 is applied to show that

E[(e′kS−1n HS−1

n′el −

i 6=je′kυiP

Λij (1− PΛ

jj)−1υ′jel/n)2|Z]→ 0.

Next consider the event An,v := (|e′kS−1n HS−1

n′el−

∑i 6=j e

′kυiP

Λij (1−PΛ

jj)−1υ′jel/n| > v. By

the conditional Markov inequality, for any v > 0,

P(An,v|Z) a.s.→ 0.

Dominated convergence then gives P(An,v) = E[P(An,v|Z)] → 0. This can be repeated for allk, l 6 G, giving the convergence of all components of S−1

n HS−1n to complete the proof of (i).

To show (ii), apply Lemma 1 with Yi = e′kS−1n Xi and Wi = ξi. Note that wi = 0 and σWn 6 C.

For fixed k and l, Lemma 1 gives

E[(e′kS−1n

i 6=jXiP

Λij ξjel)

2|Z] 6 CK/rn + C.

The conclusion follows by the same argument as for (i).

Lemma 6. If Assumptions 1-4 are satisfied then, S−1n HS−1

n =∑n

i=1 υiυi/n+ oP (1)

Proof. It is immediate from the definition of υi that∑n

i=1 υiυi/n =∑

i 6=j υiPΛij (1− PΛ

ii )−1υ′j/n.

Then the result is immediate from Lemma 5.

Page 37: Instrumental variables estimation with many weak instruments using regularized JIVE

36 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Lemmas 7 and 8 are used in the proof of Theorem 4. They aid in bounding sample analoguesof the quantities Ωn and Ψn terms to their expectations. Note that the objects used in Lemmas7 and 8 are defined in the proof of Theorem 4.

Lemma 7. Under Assumptions 1-6, Σ1 − Σ1 = oP (1) and Σ1 − Σ1 = oP (K/rn).

Proof. The proof of Lemma 7 is similar to the proof of lemma A7 given in CSHNW.

Let XPΛ

i = Xi/(1− PΛii ). Then ξ2

i − ξ2i = −2ξiXPΛ

i

′(δ − δ0) + [XPΛ

i

′(δ − δ0)]2. Let ηi be any

component of −2ξiXPΛ

i

′or XPΛ

i XPΛ

i

′. Note that Sn/

√n is bounded, so that ‖Υi‖ 6 C for some

C. Then

E[η2i |Z] 6 CE[ξ2

i |Z] + CE[‖Xi‖2|Z] 6 C + C‖Υi‖2 + CE[‖Ui‖2|Z] 6 C. By Lemma 4,

i 6=j 6=kXiP

ΛikηkP

ΛkjX

′j = OP (1).

Considering the expression for ξ2i − ξ2

i , it is clear that the expression for Σ1 − Σ1 is a sum ofterms of the form

∆∑

i 6=j 6=kXiP

ΛikηkP

ΛkjX

′j

where ∆ = oP (1). Therefore, after applying triangle inequality, Σ1 − Σ1p→ 0.

Next consider the second conclusion that Σ2−Σ2 = oP (K/rn). Consider the random variablesA = 1 + ‖δ‖, B = ‖δ − δ0‖ and di = C + |εi|+ ‖Ui‖ where C is such that ‖Υi‖ 6 C. Then thefollowing inequalities all hold:

‖Xi‖ 6 C + ‖Ui‖ 6 di,

‖Xi‖ 6 Cr−1/2n di,

|ξi − ξi| 6 C|X ′i(δ − δ0)| 6 CdiB,

|ξi| 6 C|X ′i(δ − δ0) + |ξi − ξi| 6 CdiA,

|ξ2i − ξ2

i | 6 (|ξi + |ξi|)|ξi − ξi| 6 Cdi(1 + A)diB 6 Cd2i AB,

‖Xi(ξi − ξi)‖ 6 Cr−1/2n d2

i B,

‖Xiξi‖ 6 Cr−1/2n d2

i A,

‖Xiξi‖ 6 Cr−1/2n d2

i .

Because E[d2i |Z] 6 C, it follows that

E[∑

i 6=jPΛij

2d2i d

2jr−1n |Z] 6 Cr−1

n

i,j

PΛij

2 6 Cr−1n

n∑

i=1

PΛii 6 CK/rn.

Page 38: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 37

Therefore,∑

i 6=j PΛij

2d2i d

2jr−1n = OP (K/rn).

Then, one can bound Σ2− Σ1 =∑

i 6=j PΛij

2(XiX′i(ξ

2j − ξ2

j )) +∑

i 6=j PΛij

2(XiξiξjX′j − XiξiξjX

′j)

by considering the two terms on the right hand side seperately as follows:

‖∑

i 6=jPΛij

2(XiX

′i(ξ

2j − ξ2

j ))‖ 6∑

i 6=jPΛij

2‖Xi‖2|ξ2j − ξ2

j | 6 Cr−1n

i 6=jPΛij

2d2i d

2j AB = oP (K/rn)

and,

‖∑

i 6=jPΛij

2(XiξiξjX

′j − XiξiξjX

′j)‖ 6

i 6=jPΛij

2(‖Xiξi‖‖Xj(ξ2

j − ξ2j )‖+ ‖Xjξj‖‖Xi(ξi − ξi)‖)

6 Cr−1n

i 6=jPΛij

2d2i d

2j AB = oP (K/rn).

The triangle inequality then implies the second statement of the Lemma.

Lemma 8. Under Assumptions 1 - 6,

Σ1 =∑

i 6=j 6=k υiPΛikE[ξ2

k|Z]PΛkjυ′j/n+ oP (1) and,

Σ2 =∑

i 6=j υiυ′iE[ξ2

i |Z] + S−1n

∑i 6=j P

Λij

2(E[UiU ′i |Z]E[ξ2

i |Z] + E[Uiξ′i|Z]E[U ′jξ2j |Z]

)S−1n′ +

oP (K/rn)

Proof. The proof of Lemma 8 is similar to the proof given in CSHNW.

Apply Lemma 4 with Wi equal to an element of Xi and Yj equal to an element of Xj andηk = ξ2

k to prove the first conlcusion.

Next, use Lemma 3 to prove the second assertion by the following argument. Note thatbecause var(ξ2|Z) 6 C and rn 6 Cn, then the following inequalities hold with uki defined byuki = e′kS

−1n Ui:

E[(XikXil)2|Z] 6 CE[X4ik + X4

il|Z]

6 C(υ4ik/n

2 + E[u4ki|Z] + υ4

il/n2 + E[u4

li|Z]) 6 C/r2n

and

E[(Xikξi)2|Z] 6 CE[(υ2ikξ

2i /n+ u2

kiξ2i |Z] 6 C/n+ C/n 6 C/rn.

For Ωi := E[UiU ′i |Z], then E[XiX′i|Z] = υiυ

′i/n+S−1

n ΩiS−1n′ and E[Xiξi|Z] = S−1

n E[Uiξi|Z].Next let Wi be XikXil for some k and l, so that

E[Wi|Z] = e′kS−1n ΩiS

−1n′el + υikυil/n,

|E[Wi|Z]| 6 C/rn,

E[(XikXil)2|Z] 6 C/r2n.

Page 39: Instrumental variables estimation with many weak instruments using regularized JIVE

38 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Finally, let Yi = ξ2i and note that |E[Yi|Z]| 6 C. Then by Lemma 3,

‖∑

i 6=jPΛij

2WiYj − E[

i 6=jPΛij

2WiYj ]‖2L2,Z 6 CBn

where Bn = K(σ2Wσ

2Y + σ2

Wµ2Y + µ2

Wσ2Y

)6√K(C/rn + C/rn + C/rn) 6 C

√K/rn Therefore,

putting in the values for Wi and Yi,∑

i 6=jPΛij

2XikX

′ilξ

2j = e′k

i 6=jPΛij

2(υiυ′i/n+ S−1

n ΩiS−1n′)elE[ξ2

j |Z] +OP (√K/rn)

This bounds the first term in the difference. Similarly, consider an application of Lemma 3 withWi = Xikξi and Yi = Xilξi. This way σWnσYn +σWnσYn +σWnσYn 6 C/rn. Then it follows that

i 6=jPΛij

2XikξiξjXjl = e′kS

−1n

i 6=jPΛij

2E[Uiξi|Z]E[ξjU ′j |Z]S−1

n′el +OP (

√K/rn).

Finally, since K grows to infinity with n, OP (√K/rn) = oP (K/rn). The result follows from

application of Triangle inequality.

7.3. Verification of Example 1.

Proof. We first establish notation and then proceed to verify the assumptions in order. Denotethe singular value decomposition of Z/

√n by Z/

√n = USV ′ and an eigenvalue decomposition

Z ′Z/n = VS ′SV ′ with S ′S being K × K diagonal and V distributed according to the Haarmeasure on the orthogonal group of order K and independent of S.

We turn to Assumption 1. First, bound E[maxi PΛii ] from above. Note that PΛ can be

expressed in terms of singular value decomposition by PΛ = US(S ′S + λKIK/n)−1SU ′. Adiagonal element of PΛ therefore is of the form u′S(S ′S + λKIK/n)−1Su where u is a singularvector. Since u′u = 1, u′S(S ′S+λKIK/n)−1Su is bounded by the largest eigenvalue of S(S ′S+λKIK/n)−1S which is s2

max/(s2max + λK/n) for smax the largest singular value. This shows that

maxi PΛii 6 s2

max/(s2max + λK/n).

The probability of large deviations of smax can be bounded. For example, Theorem II.13 ofDavidson and Szarek (2001) gives P(smax

√n/K > 1+

√n/K+t) 6 exp(−Kt2/2). Then taking

M = (1 +√n/K + 1)/

√n/K, it follows that P(maxi PΛ

ii > M2/(M2 + λ)) 6 exp(−K/2). TheBorel-Cantelli Lemma then implies PΛ

ii 6M2/(M2 + λ) < 1, a.s.n. This verifies Assumption 1.

All that is required to verify Assumption 2 is that C ′ >∑n

i=1 υ2i /n > C > 0 a.s.n. This is

clear since υi ∼ N(0, c20).

Finally, we need to verify Assumption 4. Independence between Zi and ΠΛ−i gives E[υiυi] =

E[(Z ′iΠ)(Z ′iΠΛ−i)] = E[Z ′iΠZ

′i]E[ΠΛ

−i]. Because Zi ∼ N(0, IK) it follows that E[Z ′iΠZ′i] = Π′.

Page 40: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 39

Turning to the term E[ΠΛ−i], decompose Z−i/

√n = U−iS−iV−i. Then ΠΛ

−i = V−i(S ′−iS−i +λKIK/n)−1S ′−iS−iV−iΠ. A basic property of the Haar measure is that for any vector π ∈SK−1c0 ⊂ RK on the sphere of radius co, if V ∼ Haar(OK) then Vπ ∼ Unif(SK−1

c0 ). Thisalong with the fact that V and S are independent imply that (S ′−iS−i + λKIK/n)−1S ′−iS−i isindependent of V−iΠ. Therefore, E[Π′ΠΛ

−i] = E[(V−iΠ)′(S ′−iS−i + λKIK/n)−1S ′−iS−i(V−iΠ)] =‖V−iΠ‖2,IKE[trace((S ′−iS−i + λKIK)−1S ′−iS−i)]. Denoting the k-th diagonal element of S ′−iS−iby s2

k, E[Π′ΠΛ−i] simplifies to ‖V−iΠ‖22,IKE[

∑Kk=1 s

2−i,k/(s

2−i,k + λK/n)]. Summarize this with

E[υiυi] = c20E[

K∑

k=1

s2−i,k/(s

2−i,k + λK/n)].

The distribution of s2−i,kKk=1 converges weakly to the Marchenko-Pastur law. For K/n →

ρ < ∞, the distribution is nondegernate and compactly supported. Therefore, there is C > 0such that for n sufficiently large, E[υiυi] > C for all i which shows that E[

∑ni=1 υiυi/n] > C

for n sufficiently large. We next employ a concentration inequality to show that the event∑n

i=1 υiυi/n < C/2 i.o. has probability zero.

To simplify the notation, define υ and υ by stacking the individual vectors υ′i and υ′i. Re-call that

∑ni=1 υiυ

′i/n = υ′υ/n = υ′(IK − diag(PΛ))−1(PΛ − diag(PΛ))υ/n. We begin by

showing that the elements of of the diagonal diag(PΛ) are uniformly strongly concentratedaround E[PΛ

ii ]. This means that in the Gaussian case with reguralization, all observations haveapproximately the same leverages with high probability. Again, using the fact that PΛ =US(S ′S+λKIK/n)−1SU ′, it follows that PΛ

ii = u′iS(S ′S+λKIK/n)−1Sui for ui ∼ Unif(SK−1).Then consider the function h : SK−1 → R, defined by h(u) = u′S(S ′S + λKIK/n)−1Su. Condi-tional on S, h is Lipschitz with respect to the standard metric on SK−1. To see this, note that|h(u)− h(w)|/d(u,w) achieves its maximum with u being the left singular vector correspondingto the largest singular value and w being the left singular vector corresponding to the smallestsingular value. In this case, |h(u)− h(w)| = |s2

max/(s2max + λ)− s2

min/(s2min + λK/n)|. It follows

that PUnif(SK−1)(|h(u) − EUnif(SK−1)h(u)| > t) 6 CS exp(−C ′Snt2) for constants CS , C ′S thatdepend on S only through smax.

Define the event A0 = smax

√n/K 6 1 +

√n/K + 1. If we take t to be t = n−1/4, then

P(|PΛii − E[PΛ

ii ]| > n−1/4|A0) 6 C exp(−Cn1/2), the two occurances of C signifying possiblydifferent constants. In addition, by the union bound, P(maxi |PΛ

ii − E[PΛii ]| > n−1/4|A0) 6

nC exp(−Cn1/2).

Define the event A1 = maxi |PΛii −E[PΛ

ii ]| 6 n−1/4⋂A0. Then conditional on A1, we havethat

υ′υ/n(1− (E[PΛii ]− n1/4)−1(E[PΛ

ii ]− n−1/4) 6 (IK − diag(PΛ))−1diag(PΛ)υ/n

6 υ′υ/n(1− (E[PΛii ] + n1/4)−1(E[PΛ

ii ] + n−1/4).

Page 41: Instrumental variables estimation with many weak instruments using regularized JIVE

40 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Similarly, conditional on A1,

υ′PΛυ/n(1− (E[PΛii ]− n−1/4))−1 6 (IK − diag(PΛ))−1PΛυ/n

6 υ′PΛυ/n(1− (E[PΛii ] + n−1/4))−1

We next show that υ′υ/n and υ′PΛυ/n are concentrated around their means. First noteυ′PΛυ/n = Π′VS ′S(S ′S + λKIK/n)−1S ′SV ′Π/n when expressed using the singular value de-composition. This quantity has the same distribution for all Π that satisfy ‖Π‖2,IK = c0. Inparticular, the distribution of υPΛυ/n is the same as the distribution of the (1,1)-element ofVS ′S(S ′S+λK/nI)−1S ′SV ′ multiplied by c2

0 which is of the form v′S ′S(S ′S+λKIK/n)−1S ′Sv/nwith v′v = 1 uniformly distributed on the unit sphere. Using an argument similar to thatbounding the diagonal entries of PΛ, it follows that P(|υ′PΛυ − E[υPΛυ]| > n−1/4|A0) 6C exp(−Cn1/2). Similarly, P(|υ′υ − E[υ′υ/n]| > n−1/4|A0) 6 C exp(−Cn1/2). Define the eventA2 = |υ′PΛυ − E[υ′υ/n]| 6 n−1/4⋂|υ′υ − E[υ′υ/n]| > n−1/4⋂A0.

Conditional on A0⋂A1

⋂A2,

(E[υ′υ/n]− n−1/4)(1− (E[PΛii ]− n1/4)−1(E[PΛ

ii ]− n−1/4)

6 (IK − diag(PΛ))−1diag(PΛ)υ/n

6 (E[υ′υ/n] + n−1/4)(1− (E[PΛii ] + n1/4)−1(E[PΛ

ii ] + n−1/4).

Similarly,

(E[υ′PΛυ/n]− n−1/4)(1− (E[PΛii ]− n−1/4))−1

6 (IK − diag(PΛ))−1PΛυ/n

6 (E[υ′PΛυ/n]− n−1/4)(1− (E[PΛii ] + n−1/4))−1.

It follows that for n sufficiently large, υ′υ/n > E[υ′υ/n]/2 > 0 on A0⋂A1

⋂A2. The proba-bilities of events A0, A1, and A2 are all bounded below by 1−Cn exp(−Cn1−2). Therefore, by asimple union bound and by Borel-Cantelli, P(Ac0

⋃Ac1⋃Ac2 i.o.) = 0. This verifies Assumption

4.

An alternative proof of the validity of Assumption 4 can be carried out by explicitly calculatingE[(υ′υ/n− E[υ′υ/n])4]. Though this approach has the advantage that it does not leverage theGaussian structure of the instruments to the extent of the preceding argument, it is much longerand we therefore exclude the details.

Page 42: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 41

7.4. Derivation of Recursive Residual Expression. The recursive residuals expression canbe shown, for example, using the Sherman-Morrison-Woodbury formula. For a regression Z onX,

Π′−iZi = [(Z ′Z − ZiZ ′i)−1(Z ′X − ZiXi)]′Zi

=[[(Z ′Z)−1 +

(Z ′Z)−1ZiZ′i(Z′Z)−1

1− Z ′i(Z ′Z)Zi](Z ′X − ZiXi)

]′Zi

=[[1− Pii1− Pii

(Z ′Z)−1 +(Z ′Z)−1ZiZ

′i(Z′Z)−1

1− Pii](Z ′X − ZiXi)

]′Zi

=[(X ′Z(Z ′Z)−1Zi −X ′iZ ′i(Z ′Z)−1Zi)(1− Pii) +X ′Z(Z ′Z)−1ZiZ

′i(Z′Z)−1Zi

−XiZ′i(Z′Z)−1ZiZ

′i(Z′Z)−1Zi

]/(1− Pii)

=[X ′Z(Z ′Z)−1Zi(1− Pii)−X ′iPii(1− Pii) +X ′Z(Z ′Z)−1ZiPii −XiP

2ii

]/(1− Pii)

=[X ′Z(Z ′Z)−1Zi −XiPii

]/(1− Pii)

as desired.

References

Amemiya, T. (1966): “On the Use of Principal Components of Independent Variables in Two-Stage Least-Squares

Estimation,” International Economic Review, 7, 283–303.

Amemiya, T. (1974): “The Non-linear Two-Stage Least Squares Estimator,” Journal of Econometrics, 2, 105–110.

Andrews, D. W. K., and J. H. Stock (2007): “Inference with Weak Instruments,” in Advances in Econom-

ics and Econometrics, Theory and Applications, 9th Congress of the Econometric Society, Volume 3, ed. by

R. Blundell, W. Newey, and T. Persson. Cambridge University Press.

Angrist, J. D., G. W. Imbens, and A. B. Krueger (1999): “Jackknife Instrumental Variables Estimation,”

Journal of Applied Econometrics, 14(1), 57–67.

Angrist, J. D., and A. B. Krueger (1991): “Does Compulsory School Attendance Affect Schooling and

Earnings?,” The Quarterly Journal of Economics, 106(4), 979–1014.

Bai, J., and S. Ng (2009): “Selecting Instrumental Variables in a Data Rich Environment,” Journal of Time

Series Econometrics, 1(1).

(2010): “Instrumental Variable Estimation in a Data Rich Environment,” Econometric Theory, 26,

15771606.

Bekker, P. A. (1994): “Alternative Approximations to the Distributions of Instrumental Variables Estimators,”

Econometrica, 63, 657–681.

Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2010): “Sparse Models and Methods for Optimal

Instruments with an Application to Eminent Domain,” forthcoming Econometrica.

Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009): “Simultaneous analysis of Lasso and Dantzig selector,”

Annals of Statistics, 37(4), 1705–1732.

Bound, J., D. A. Jaeger, and R. M. Baker (1995): “Problems with Instrumental Variables Estimation When

the Correlation Between the Instruments and the Endogenous Explanatory Variable is Weak,” Journal of the

American Statistical Association, 90(430), 443–450.

Buhlmann, P. (2006): “Boosting for high-dimensional linear models,” Ann. Statist., 34(2), 559–583.

Caner, M. (2009): “LASSO-Type GMM Estimator,” Econometric Theory, 25, 270–290.

Page 43: Instrumental variables estimation with many weak instruments using regularized JIVE

42 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Carrasco, M. (2012): “A Regularization Approach to the Many Instruments Problem,” forthcoming in Journal

of Econometrics.

Carrasco, M., and G. Tchuente Nguembu (2012): “Regularized LIML with Many Instruments,” Discussion

paper, University of Montreal Working paper.

Chamberlain, G. (1987): “Asymptotic Efficiency in Estimation with Conditional Moment Restrictions,” Journal

of Econometrics, 34, 305–334.

Chamberlain, G., and G. Imbens (2004): “Random Effects Estimators with Many Instrumental Variables,”

Econometrica, 72, 295–306.

Chao, J., and N. Swanson (2005): “Consistent Estimation With a Large Number of Weak Instruments,”

Econometrica, 73, 1673–1692.

Chao, J. C., N. R. Swanson, J. A. Hausman, W. K. Newey, and T. Woutersen (2012): “Asymptotic

Distribution of JIVE in a Heteroskedastic IV Regression with Many Instruments,” Econometric Theory, 28(1),

42–86.

Davidson, K., and S. Szarek (2001): “Local Operator Theory, Random Matrices and Banach Spaces,” in

Handbook of the Geometry of Banach Spaces, ed. by W. B. Johnson, and J. Lindenstrauss, pp. 317–366. New

York: Elsevier.

Dicker, L. (2012): “Optimal Estimation and Prediction for Dense Signals in High-Dimensional Linear Models,”

ArXiv working paper.

Donald, S. G., and W. K. Newey (2001): “Choosing the Number of Instruments,” Econometrica, 69(5),

1161–1191.

Fuller, W. A. (1977): “Some Properties of a Modification of the Limited Information Estimator,” Econometrica,

45, 939–954.

Gautier, E., and A. B. Tsybakov (2011): “High-Dimensional Instrumental Variables Regression and Confi-

dence Sets,” ArXiv working report.

Hansen, C., J. Hausman, and W. K. Newey (2008): “Estimation with Many Instrumental Variables,” Journal

of Business and Economic Statistics, 26, 398–422.

Kapetanios, G., L. Khalaf, and M. Marcellino (2011): “Factor based identification-robust inference in IV

regressions,” working paper.

Kapetanios, G., and M. Marcellino (2010): “Factor-GMM estimation with large sets of possibly weak

instruments,” Computational Statistics & Data Analysis, 54(11), 26552675.

Kloek, T., and L. Mennes (1960): “Simultaneous Equations Estimation Based on Principal Components of

Predetermined Variables,” Econometrica, 28, 45–61.

Newey, W. K. (1990): “Efficient Instrumental Variables Estimation of Nonlinear Models,” Econometrica, 58,

809–837.

Newey, W. K., and R. J. Smith (2004): “Higher Order Properties of GMM and Generalized Empirical Likeli-

hood Estimators,” Econometrica, 72(1), 219–255.

Okui, R. (2010): “Instrumental Variable Estimation in the Presence of Many Moment Conditions,” forthcoming

Journal of Econometrics.

Phillips, G. D. A., and C. Hale (1977): “The bias of instrumental variable estimators of simultaneous equation

systems,” International Economic Review, 18, 219–228.

Staiger, D., and J. H. Stock (1997): “Instrumental Variables Regression with Weak Instruments,” Econo-

metrica, 65, 557–586.

Stock, J. H., J. H. Wright, and M. Yogo (2002): “A Survey of Weak Instruments and Weak Identification

in Generalized Method of Moments,” Journal of Business and Economic Statistics, 20(4), 518–529.

Tibshirani, R. (1996): “Regression shrinkage and selection via the Lasso,” J. Roy. Statist. Soc. Ser. B, 58,

267–288.

Page 44: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 43

Med. Bias MAD RP 5% Med. Bias MAD RP 5%

RJIVE 0.045 0.096 0.079 0.119 0.251 0.073Post-LASSO 0.078 0.078 0.028 0.158 0.158 0.042LASSO-JIVE 0.081 0.081 0.859 0.229 0.229 0.871Carrasco 0.078 0.078 0.864 0.215 0.215 0.8612SLS 0.086 0.086 0.987 0.242 0.242 0.989JIVE 0.070 0.122 0.089 0.197 0.311 0.081RJIVE-CV 0.039 0.092 0.073 0.092 0.236 0.067

RJIVE 0.000 0.025 0.053 -0.002 0.089 0.065Post-LASSO 0.041 0.041 0.287 0.055 0.064 0.147LASSO-JIVE 0.048 0.048 0.757 0.153 0.153 0.788Carrasco 0.029 0.029 0.381 0.103 0.103 0.4412SLS 0.053 0.053 0.963 0.165 0.165 0.963JIVE 0.015 0.049 0.078 0.051 0.150 0.084RJIVE-CV 0.001 0.024 0.062 0.000 0.091 0.057

RJIVE 0.004 0.064 0.065 0.003 0.163 0.050Post-LASSO 0.102 0.103 0.219 0.117 0.136 0.145LASSO-JIVE 0.095 0.095 0.641 0.048 0.048 0.757Carrasco 0.086 0.086 0.557 0.029 0.029 0.3812SLS 0.111 0.111 0.873 0.053 0.053 0.963JIVE 0.017 0.103 0.074 0.015 0.049 0.078RJIVE-CV -0.004 0.058 0.044 -0.001 0.162 0.049

RJIVE-CV -0.002 0.020 0.049 0.004 0.064 0.057Post-LASSO 0.035 0.035 0.310 0.021 0.052 0.075LASSO-JIVE 0.038 0.038 0.395 0.134 0.134 0.498Carrasco 0.020 0.023 0.157 0.070 0.075 0.1772SLS 0.049 0.049 0.653 0.154 0.154 0.641JIVE 0.001 0.028 0.051 0.005 0.091 0.050RJIVE -0.001 0.020 0.054 -0.008 0.062 0.046

C. Concentration Parameter = 150. Binary Instruments

D. Concentration Parameter = 150. Gaussian Instruments

Note: Results are based on 1500 simulation replications. We report Median Bias (Med. Bias), Median absolute deviation (MAD) and rejection frequency for a 5% level test (RP 5%) for five different estimators: the RJIVE proposed in this paper (RJIVE); the Post-LASSO IV estimator of Belloni, Chernozhukov, Chen, and Hansen (2012, Post-LASSO), an estimator that uses LASSO model selection with a small penalty level that ensure that instruments are chosen and then uses the selected instruments with the JIVE (LASSO-JIVE); the estimator of Carrasco (2012, Carrasco); 2SLS; JIVE; and the proposed RJIVE estimator with penalty parameter chosen by cross-validation (RJIVE-CV). In all simulations, the correlation between the first stage error and structural error is set to 0.6.

Table 1. Simulation Results many instruments K = 95

A. Concentration Parameter = 30. Binary Instruments

Dense Signal Sparse Signal

B. Concentration Parameter = 30. Gaussian Instruments

Page 45: Instrumental variables estimation with many weak instruments using regularized JIVE

44 CHRISTIAN HANSEN AND DAMIAN KOZBUR

Med. Bias MAD RP 5% Med. Bias MAD RP 5%

RJIVE 0.041 0.078 0.075 0.166 0.280 0.083Post-LASSO 0.067 0.067 0.021 0.177 0.177 0.035LASSO-JIVE 0.064 0.064 0.995 0.249 0.249 0.998Carrasco 0.060 0.060 0.965 0.228 0.228 0.9692SLS 0.064 0.064 0.992 0.248 0.248 0.995JIVE 0.068 0.096 0.079 0.230 0.359 0.077RJIVE-CV 0.045 0.075 0.088 0.157 0.292 0.062

RJIVE -0.001 0.021 0.060 0.002 0.115 0.066Post-LASSO 0.035 0.035 0.218 0.052 0.066 0.145LASSO-JIVE 0.040 0.040 0.980 0.184 0.184 0.991Carrasco 0.027 0.027 0.610 0.133 0.133 0.6862SLS 0.039 0.039 0.973 0.178 0.178 0.973JIVE 0.022 0.046 0.074 0.089 0.200 0.072RJIVE-CV 0.002 0.020 0.084 0.012 0.114 0.070

RJIVE 0.006 0.058 0.065 0.019 0.205 0.059Post-LASSO 0.083 0.083 0.199 0.119 0.139 0.161LASSO-JIVE 0.088 0.088 0.953 0.347 0.347 0.965Carrasco 0.072 0.072 0.751 0.275 0.275 0.7302SLS 0.088 0.088 0.909 0.335 0.335 0.888JIVE 0.063 0.127 0.055 0.179 0.427 0.055RJIVE-CV 0.004 0.059 0.060 0.020 0.208 0.052

RJIVE -0.001 0.016 0.054 0.000 0.074 0.049Post-LASSO 0.031 0.031 0.425 0.025 0.054 0.069LASSO-JIVE 0.041 0.041 0.803 0.194 0.194 0.847Carrasco 0.021 0.021 0.266 0.107 0.107 0.3072SLS 0.038 0.038 0.692 0.173 0.173 0.699JIVE 0.002 0.029 0.057 0.015 0.141 0.050RJIVE-CV 0.000 0.015 0.060 -0.001 0.074 0.046

D. Concentration Parameter = 150. Gaussian Instruments

Note: Results are based on 1500 simulation replications. We report Median Bias (Med. Bias), Median absolute deviation (MAD) and rejection frequency for a 5% level test (RP 5%) for five different estimators: the RJIVE proposed in this paper (RJIVE); the Post-LASSO IV estimator of Belloni, Chernozhukov, Chen, and Hansen (2012, Post-LASSO), an estimator that uses LASSO model selection with a small penalty level that ensure that instruments are chosen and then uses the selected instruments with the JIVE (LASSO-JIVE); the estimator of Carrasco (2012, Carrasco); 2SLS; JIVE; and the proposed RJIVE estimator with penalty parameter chosen by cross-validation (RJIVE-CV). In all simulations, the correlation between the first stage error and structural error is set to 0.6.

Table 2. Simulation Results many instruments K = 190

Dense Signal Sparse Signal

A. Concentration Parameter = 30. Binary Instruments

B. Concentration Parameter = 30. Gaussian Instruments

C. Concentration Parameter = 150. Binary Instruments

Page 46: Instrumental variables estimation with many weak instruments using regularized JIVE

REGULARIZED JIVE 45

2SLS Post‐LASSO JIVE RJIVE

Schooling Coefficient 0.1079 0.1115 0.1091 0.1091

Estimated Standard Error 0.0196 0.0205 0.0202 0.0202

Schooling Coefficient 0.0928 0.1125 0.1096 0.1062

Estimated Standard Error 0.0097 0.0173 0.0161 0.0157

Schooling Coefficient 0.0712 0.0862 0.0816 0.1067

Estimated Standard Error 0.0049 0.0254 0.5168 0.0171

Table 3: Estimates of the Return to Schooling in Angrist

and Krueger Data

Note: This table reports estimates of the returns‐to‐schooling parameter in the Angrist‐Krueger 1991 data using

different estimators and different numbers of instruments. In the rows, we give point estimates of the schooling

coefficient and heteroskedasticity consistent standard error estimates. We report results for 2SLS, the Post‐LASSO

estimator of Belloni, Chen, Chernozhukov, and Hansen (2012) (Post‐LASSO), JIVE, and our regularized JIVE (RJIVE).

Further details are provided in the text. For comparison, the OLS estimate (standard error) of the schooling

coefficient is 0.0673 (0.0004).

A. 3 Instruments

C. 1527 Instruments

B. 180 Instruments


Recommended