Analysis of Testing-Based Forward Model Selection
Damian Kozbur
University of Zurich
Department of Economics
Schonberggasse 1, 8001 Zurich
e-mail: [email protected].
Abstract: This paper analyzes a procedure called Testing-Based Forward Model Selection
(TBFMS) in linear regression problems. This procedure inductively selects covariates that
add predictive power into a working statistical model before estimating a final regression.
The criterion for deciding which covariate to include next and when to stop including
covariates is derived from a profile of traditional statistical hypothesis tests. This paper
proves probabilistic bounds, which depend on the quality of the tests, for prediction error
and the number of selected covariates. As an example, the bounds are then specialized to a
case with heteroskedastic data, with tests constructed with the help of Huber-Eicker-White
standard errors. Under the assumed regularity conditions, these tests lead to estimation
convergence rates matching other common high-dimensional estimators including Lasso.
MSC 2010 subject classifications: 62J05, 62J07, 62L12.
JEL subject classifications: C55.
Keywords and phrases: model selection, forward regression, sparsity, hypothesis testing.
∗ First version: November 1, 2015. This version April 7, 2020. I gratefully acknowledge helpful discussions
with Christian Hansen, Tim Conley, Kelly Reeve, Dan Zou, Nicolai Meinshausen, Marloes Maathius, Rahul
Mazumder, Ryan Tibshirani, Trevor Hastie, Martin Schonger, Pietro Biroli, Michael Wolf, Lorenzo Casaburi,
Hannes Schwandt, Ralph Ossa, Rainer Winkelmann, attendants at the ETH Zurich Seminar fur Statistik Re-
search Seminar, attendants at the Center for Law and Economics Internal Seminar, attendants at the Toulouse
School of Economics Econometrics Seminar, research assistants Vincent Lohmann and Alexandre Jenni for as-
sisting in several stages of verification of formal arguments, Yuming Pan for assistance with computation, as
well as financial support from UZH and from the ETH Fellowship program. The content of this paper draws
from two separate working papers posted on ArXiv (https://arxiv.org/abs/1702.01000 (Kozbur (2017a)) and
https://arxiv.org/abs/1512.02666 (the current paper)). The projects have been merged in preparation for the
publication process. A mechanical statement of a special case of TBFMS appears in Kozbur (2017b), which nei-
ther claims nor derives any theoretical results. This paper previously had the title “Testing-Based Forward Model
Selection,” which is also the title of Kozbur (2017b).
arX
iv:1
512.
0266
6v13
[m
ath.
ST]
6 A
pr 2
020
Damian Kozbur/ATBFMS 1
1. Introduction
This paper analyzes a procedure called Testing-Based Forward Model Selection (TBFMS) for
high-dimensional econometric problems, which are characterized by settings in which the number
of observed characteristics per observation in the data is large.1 High-dimensional econometrics
is a leading area of current research because of recent rapid growth in data availability and
computing capacity, coupled with the important need to extract as much useful information
from data in a way that allows precise and rigorous testing of scientific hypotheses.
The primary settings of this paper are high-dimensional sparse linear regression models, in
which the number of covariates is allowed to be commensurate with or exceed the sample size. A
key challenge with a high-dimensional data set is that estimation requires dimension reduction
or regularization to avoid statistical overfitting. A sparsity assumption imposes that the regres-
sion function relating the outcome and the covariates can be approximated by a regression of
the outcome on a small, ex ante unknown subset of covariates. Under sparsity, there are sev-
eral consistent estimation procedures (further reviewed below) that work by enforcing that the
estimated regression function be sparse or small under an appropriate norm.
An appealing class of techniques for high-dimensional regression problems are greedy algo-
rithms. These are procedures that inductively select individual covariates into a working model
(i.e., a collection of covariates) until a stopping criterion is met. A linear regression restricted
to the final selected model is then estimated. A leading example is Simple Forward Selection,2
which chooses the covariate that gives the highest increase of in-sample R-squared above the
previous working model. This class of techniques is widely used because they are intuitive and
simple to implement. Such methods in the statistics literature date back to at least Efroymson
(1966).
In practice, deciding which covariate gives the best additional predictive power relative to a
working model is complicated by the fact that outcomes are observed with noise or are partly
idiosyncratic. For example, in linear regression, a variable associated with a positive increment of
in-sample R-squared upon inclusion may not add any predictive power out-of-sample. Statistical
hypothesis tests offer one way to determine whether a variable of interest is likely to improve
out-of-sample predictions. Furthermore, in many econometric and statistical applications, the
classical assumption of independent and identically distributed data is not always appropriate.
The availability of hypothesis tests for diverse classes of problems and settings motivates the
introduction of a testing-based strategy. Mechanically, TBFMS begins with an empty model.
The procedure then tests whether any covariates provide additional predictive capability in the
population. The selection stops when no tests return a significant covariate. Selection into the
1High-dimensional data may arise in several ways—data may be intrinsically high-dimensional with many
characteristics per observation, or alternatively, researchers may obtain a large final set of covariates through
forming interactions and transformations of underlying covariates.2Simple Forward Selection is not standard nomenclature, but is used here in order to have a parallel language
with Testing-Based Forward Model Selection. The literature is varied and uses several names including Forward
Regression and Forward Stepwise Regression. “Model” is used in the name of TBFMS to avoid confusion with
sample selection problems common in econometrics.
Damian Kozbur/ATBFMS 2
model is then based on the largest value of an associated test statistic. Note that in this context,
the hypothesis tests are solely serving a role as assisting model selection, not ex post inference.
There are several earlier analyses of Simple Forward Selection.3 Wang (2009) gives bounds
on the performance and number of selected covariates under a β-min condition which requires
the minimum magnitude of non-zero coefficients to be suitably bounded away from zero. Zhang
(2009) and Tropp (2004) prove performance bounds for greedy algorithms under a strong irrepre-
sentability condition, which restricts the empirical covariance matrix of the predictors. Das and
Kempe (2011) prove bounds on the relative performance in population R-squared of a forward
selection based model (relative to infeasible R-squared) when the number of variables allowed for
selection is fixed. In addition to Simple Forward Selection, there are several related procedures
in which estimation is done in stages. These include a method that is not strictly greedy called
Forward-Backward Selection, which proceeds similarly to Simple Forward Selection but allows
previously selected covariates to be discarded from the working model at certain steps (see Zhang
(2009)).
As a preliminary, this paper proves new bounds on the predictive performance and number
of selected covariates for Simple Forward Selection. The conditions required here are weaker
that those used in Zhang (2009) and Tropp (2004) and impose no β-min restrictions or irrepre-
sentability. The convergence rates here are most similar to the analysis of a Forward-Backward
Selection in Zhang (2011), but require markedly different analysis since there is no chance to
correct “over-selection mistakes.”
This paper then gives performance bounds for TBFMS which depend directly on the quality of
the profile of tests considered, as measured by five constants which characterize size and power.
The abstract results for TBFMS are used to derive asymptotic bounds for various sequences of
data-generating processes. As an example, concrete tests for heteroskedastic data constructed
from Huber-Eicker-White standard errors are used to construct t-tests and explicit rates of
convergence are calculated.
There are many other sensible approaches to high-dimensional estimation. An important and
common approach to generic high-dimensional estimation problems is the Lasso. The Lasso
minimizes a least squares criterion augmented with a penalty proportional to the `1 norm of
the coefficient vector. For theoretical and simulation results for Lasso, see Frank and Friedman
(1993) Tibshirani (1996), Hastie et al. (2009) Candes and Tao (2007) Bai and Ng (2008), Bickel
et al. (2009), Huang et al. (2010), Buhlmann and van de Geer (2011), among many more. Other
related methods include boosting (see Freund and Schapire (1996), Buhlmann (2006), Luo and
Spindler (2017)), Least Angle Regression (see Efron et al. (2004) ), Post-Lasso (see Belloni et al.
(2012)), and many others. A recent related paper, Chudik et al. (2018), considers a different
iterative model selection procedure which also involves using hypothesis tests. In Chudik et al.
(2018), in the first iteration, a marginal regression of the outcome on each potential covariate is
run. Once all marginal regressions are run, all significant covariates are included into a working
3TBFMS using different tests than proposed here is natively programmed in some statistical software, including
SPSS, but is not previously formally justified in high-dimensional settings.
Damian Kozbur/ATBFMS 3
model. Each subsequent iteration works similarly.
The asymptotic estimation rates calculated here for TBFMS, applied to a constructed profile
of tests for heteroskedastic data, match those standard for Lasso and Post-Lasso. Relative to the
analysis of asymptotic properties of Lasso and related high-dimensional estimation techniques,
analysis of TBFMS is complicated by the fact that the procedure is not the optimizer of a simple
objective function. As a result, the theory also departs from the literature on m-estimation in a
fundamental way.
A recent paper, Hastie et al. (2017), performs a systematic simulation analysis of statistical
and computational performance of simple forward selection as well as a few additional estimators,
including Lasso and best subset selection; see Bertsimas et al. (2016). The paper reports that in
regression models with higher signal-to-noise ratios, forward selection performs favorably relative
to Lasso; a finding consistent across many simulation settings.
This paper complements recent literature on sequential testing (see G’Sell et al. (2016), Li
and Barber (2017), Tibshirani et al. (2014), Fithian et al. (2015)). Sequential testing considers
hypothesis testing in stages, in which tests in later stages can depend on testing outcomes in
earlier stages. In various settings, properties like family-wise error rates of proposed testing
procedures can be controlled over such sequences of hypothesis tests. While the current paper
focuses on statistical properties of estimates after TBFMS given properties of the implemented
tests, future work may potentially combine the two types of problems.
In economic applications, models learned using formal model selection are often used in sub-
sequent estimation steps, with the final goal of learning a structural parameter of interest. One
example is the selection of instrumental variables for later use in a first-stage regression (see
Belloni et al. (2012)). Another example is the selection of a conditioning set to properly control
for omitted variables bias when there are many control variables (see Zhang and Zhang (2014),
Belloni et al. (2014), van de Geer et al. (2014), and Javanmard and Montanari (2014)). Bounds
about the quality of the selected model are used to derive results about the quality of post-
model selection estimation and to guide subsequent inference. Such applications require a model
selection procedure with hybrid objectives: (1) produce a good fit, and (2) return a sparse set of
variables. This paper addresses both objectives by providing sparsity and fit bounds for TBFMS.
In terms of computing, one fast implementation of forward selection depends on what is
sometimes referred to as a “guided QR decomposition.” Formally, simple forward selection can
be computed in O(npk) flops, with n being sample size, p being number of covariates, and k
being number of steps (see for example Hastie et al. (2017)), and requires the storage of the QR
decomposition of at most k variables. The version of TBFMS presented in the paper for data
with heteroskedastic disturbances can be computed with the same order of time and storage
requirements.4
4A modification of the least angle regression algorithm can be made to implement a similarly efficient com-
putation of Lasso (see Efron et al. (2004), though this may require potentially more iterations, as covariates can
multiple times both enter and exit a suitably defined active set which terminates as the selected set).
Damian Kozbur/ATBFMS 4
2. Precursor: Sharp Convergence Rates for Simple Forward Selection without
β-min or Irrepresentability Conditions
This section proves a precursory result about Simple Forward Selection which is new in the
high-dimensional econometrics and statistics literature. The procedure is defined formally below
and is similar to TBFMS, but uses a single threshold rather than a profile of hypothesis tests in
determining the selection of covariates. The framework set out in this section is also helpful in
terms of outlining minimal structure needed to facilitate the method of analysis in the formal
arguments that follow.
2.1. Framework
A realization of data of sample size n is given by Dn = (xi, yi)ni=1 and is generated by a joint
distribution P. The data consist of a set of covariates xi ∈ Rp, as well as outcome variables
yi ∈ R for each observation i = 1, ..., n. The data satisfy
yi = x′iθ0 + εi
for some unknown parameter of interest θ0 ∈ Rp and unobserved disturbance terms εi ∈ R.
The parameter θ0 is sparse in the sense that the set of non-zero components of θ0, denoted
S0 = supp(θ0), has cardinality s0 < n. (Below, exact orthogonality between εi and xi will not be
required; rather a notion of approximate orthogonality will be used. As a result, the framework
also handles notions of approximate sparsity).
Define an empirical loss function `(θ)
`(θ) = En[(yi − x′iθ)2],
where En[ · ] = 1n
∑ni=1(·) denotes empirical expectation. Note that `(θ) depends on Dn, but this
dependence is suppressed from the notation. Define also for subsets S ⊆ 1, ..., p
`(S) = minθ:supp(θ)⊆S
`(θ).
The estimation strategy proceeds by first searching for a sparse subset S ⊆ 1, ..., p, with
cardinality s, that assumes a small value of `(S), followed by estimating θ0 with least squares
via
θ ∈ arg minθ:supp(θ)⊆S
`(θ).
This gives the construction of the estimates x′iθ for i = 1, ..., n.
The set S is selected as follows. For any S define the incremental loss from the jth covariate
by
∆j`(S) = `(S ∪ j)− `(S).
Damian Kozbur/ATBFMS 5
Consider the greedy algorithm, which inductively selects the jth covariate to enter a working
model if −∆j`(S) exceeds a threshold t:
−∆j`(S) > t
and −∆j`(S) > −∆k`(S) for each k 6= j. The threshold t is chosen by the user; it is the only
tuning parameter required. This procedure is summarized formally here.
Algorithm 1. Simple Forward Regression
Initialize Set S = ∅For 1 6 k 6 p
If −∆j`(S) > t for some j ∈ 1, ..., p \ SSet j ∈ arg max −∆j`(S) : −∆j`(S) > tUpdate S = S ∪ j
Else
Break
Set θ ∈ arg minθ:supp(θ)⊂S `(θ).
2.2. Formal Analysis
In order to analyze Algorithm 1 and state the first theorem, a few more definitions are convenient.
Define the empirical Gram matrix G by G = En[xix′i]. Let ϕmin(s)(G) denote the minimal s-
sparse eigenvalues given by
ϕmin(s)(G) = minS⊆1,...,p:|S|6s
λmin(GS)
where GS is the principal submatrix of G corresponding to the component set S. The maximal
sparse eigenvalues ϕmax(s)(G) are defined analogously. Let
cF(s) = (s+ s0)1/2ϕmin(s+ s0)(G)−1/2[2‖En[xiεi]‖∞ + t1/2
].
Finally, for each positive integer m, let
c′F(m) = 80× ϕmin(m+ s0)(G)−4.
Theorem 1. Consider a data set Dn of fixed sample size n with parameter θ0. Suppose the
normalizations En[x2ij ] = 1 hold for each j 6 p. Then under Algorithm 1 with threshold t,
En[(x′iθ0 − x′iθ)2]1/2 6 cF(s).
In addition, for every integer m > 0 with m 6 |S \ S0| such that t1/2 > 2ϕmin(m +
s0)(G)−1‖En[xiεi]‖∞, it holds that
m 6 c′F(m)s0.
Damian Kozbur/ATBFMS 6
The above theorem calculates explicit finite sample constants bounding the prediction er-
ror norm. The second statement is a tool for bounding the number of selected covariates. In
particular, setting m∗ = minm : m > c′F(m)s0 implies that
s < m∗ + s0
provided that the condition on m∗ given by t1/2 > 2ϕmin(m∗ + s0)(G)−1‖En[xiεi]‖∞ is met.
The statement in Theorem 1 gives finite sample bounds which are completely deterministic
in the sense that they hold for every possible realization of the data. Furthermore, the proof
does not use any random nature of Dn at any step. As a result, the bounds are very general,
but it is helpful for interpretation to consider the convergence rates implied by Theorem 1 under
asymptotic conditions on Dn. Consider a sequence of random data sets (Dn)n∈N generated by
joint distributions (P = Pn)n∈N. For each n, the data again satisfy yi = x′iθ0 + εi. In what
follows, the parameters θ0, the thresholds t, distribution P, the dimension p of xi, etc. can all
change with n.
Condition 1 (Asymptotic Regularity). The sparsity satisfies s0 = o(n). There is a sequence Kn
for which s0 = o(Kn) and there is a bound ϕmin(Kn)(G)−1 = O(1) which holds with probability
1−o(1). The normalizations En[x2ij ] = 1 hold a.s. for every j 6 p. The threshold satisfies a bound
t = O(log p/n). In addition, t1/2 > 2ϕmin(Kn)(G)−1‖En[xiεi]‖∞ with probability 1− o(1).5
The rates assumed in Condition 1 reflect typical rates achieved under various possible sets of
low-level conditions standard in the literature (ie. Belloni et al. (2012)). Condition 1 asserts three
important statements. The first statement bounds the size of S0 and requires that the sparsity
level is small relative to the sample size. The second statement is a sparse eigenvalue condition
useful for proving results about high-dimensional techniques. In standard regression analysis
where the number of covariates is small relative to the sample size, a common assumption used
in establishing properties of conventional estimators of θ is that G has full rank. In the high-
dimensional setting, G will be singular if p > n and may have an ill-behaved inverse even when
p 6 n. However, good performance of many high-dimensional estimators only requires good
behavior of certain moduli of continuity of G. There are multiple formalizations and moduli of
continuity that can be considered here; see Bickel et al. (2009). This analysis focuses on a simple
eigenvalue condition that was used in Belloni et al. (2012). Condition 1 could be shown to hold
under more primitive conditions by adapting arguments found in Belloni and Chernozhukov
(2013), which build upon results in Zhang and Huang (2008) and Rudelson and Vershynin
(2008); see also Rudelson and Zhou (2013). Condition 1 is notably weaker than previously used
irrepresentability conditions. Irrepresentability conditions require that for certain sets S and
k /∈ S, letting xiS be the subvector of xi with components j ∈ S, ‖En[xiSx′iS ]−1En[xiSxik]‖1 is
strictly less than 1. The normalization En[x2ij ] = 1 is used to keep exposition concise and can be
5Formally, for a sequence of random variables Xn, the statement “Xn = O(1) with probability 1 − o(1)” is
defined as: “there is a constant C independent of n such that P(|Xn| > C)→ 0.”
Damian Kozbur/ATBFMS 7
relaxed (and, e.g., is relaxed in Theorem 5).
The final statement in Condition 1 is a regularization condition similar to regularization
conditions common in the analysis of Lasso. The condition requires t1/2 to dominate a multiple of
the ‖En[xiεi]‖∞. This condition is stronger than that typically encountered with Lasso, because
the multiple depends on the sparse eigenvalues of G. To illustrate why such a condition is useful,
let xij denote xij residualized away from previously selected regressors and renormalized. Then
even if En[xijεi] < t1/2, En[xijεi] can exceed t1/2, resulting in more selections into the model.
Nevertheless, using the multiple 2ϕmin(Kn)(G)−1, which stays bounded with n, is sufficient to
ensure that s does not grow faster than s0. Furthermore, this requirement does not implicitly
impose a β-min condition and does not implicitly impose irrepresentability. The requirements
on t can be relaxed if there is additional control on quantities of the form En[xijεi]. Relative
to analogous Lasso bounds in Belloni et al. (2012), Theorem 1 does not involve maximal sparse
eigenvalues. This may become relevant if the components of xi arise from factor model structures.
From a practical standpoint, Condition 1 does, however, implicitly require the user to know
more about the design of the data in choosing an appropriate t. Choosing feasible thresholds
which satisfy a similar condition to Condition 1 is considered in the next section, where analysis
of TBFMS is developed.
Theorem 2. Consider a sequence of data sets Dn indexed by n with parameters θ0 and threshold
t which satisfy Condition 1. Suppose θ is obtained by Algorithm 1. Then there are bounds
En[(x′iθ0 − x′iθ)2]1/2 = O
(√s0 log p
n
),
s 6 O(s0),
which hold with probability 1− o(1) as n→∞.
More explicitly, the implied O constants and o sequence in bounds for Theorem 2 are under-
stood to depend only on the implied O constants and o sequences in Condition 1.
The theorem shows that Simple Forward Selection can obtain asymptotically the same con-
vergence rates (specifically√s0 log p/n for the quantities En[(x′iθ0 − x′iθ)2]1/2) as other high-
dimensional estimators like Lasso, provided an appropriate threshold t is used. In addition, it
selects a set with cardinality commensurate with s0.
Finally, two direct consequences of Theorem 2 are bounds on the deviations ‖θ − θ0‖1 and
‖θ−θ0‖2 of θ from the underlying unknown parameter θ0. Theorem 3 above shows that deviations
of θ from θ0 also achieve rates typically encountered in high-dimensional estimators like Lasso.
Theorem 3. Consider a sequence of data sets Dn with parameters θ0 and thresholds t which
satisfy Condition 1. Suppose θ is obtained by Algorithm 1. Then there are bounds
‖θ0 − θ‖2 = O
(√s0 log p
n
)and ‖θ0 − θ‖1 = O
(√s2
0 log p
n
)
which hold with probability 1− o(1) as n→∞.
Damian Kozbur/ATBFMS 8
3. Testing-Based Forward Model Selection
The previous section presented results on convergence rates of Simple Forward Selection. The
results of Theorem 1 are useful in developing intuition and proof techniques for inductive variable
selection algorithms. However, in terms of practical implementation, Section 2 leaves the question
of how to choose a threshold unanswered. This section develops TBFMS in order to analyze
feasible, data-driven ways to decide which covariates to select, and when to stop selecting.
3.1. Framework
The basic framework for this section is similar to the earlier one. Again, the observed data is
given by Dn = (xi, yi)ni=1, is generated by P, and yi = x′iθ0 + εi for a parameter θ0 which is
sparse with s0 non-zero components supported on S0. Define `(θ) and `(S) as before.
Define the expected loss function E : Rp → R by
E(θ) = E[En[(yi − x′iθ)2]
]where E is expectation with respect to P. Note that E(θ) = E`(θ). Extend the definition of E to
apply also as a map E : 21,...,p → R by E(S) = minθ:supp(θ)⊆S E(θ). Similarly to before, for any
S define the incremental loss from the jth covariate by
∆jE(S) = E(S ∪ j)− E(S).
Within the class of greedy algorithms, it would be preferable to consider a greedy algorithm
which inductively selects the jth covariate to enter a working model if ∆jE(S) is large in absolute
value and −∆jE(S) > −∆kE(S) for each k 6= j. However, because ∆jE(S) cannot generally be
directly observed from the data, the idea that follows is to make use of statistical tests to gauge
the magnitude of ∆jE(S). Consider a set of tests given by
TjSα ∈ 0, 1 associated to H0 : ∆jE(S) = 0 and level α > 0.
Assume that the tests reject (TjSα = 1) for large values of a test statistic WjS .
The model selection procedure is as follows. Start with an empty model (consisting of no
covariates). At each step, if the current model is S, select one covariate such that TjSα = 1,
append it to S, and continue to the next step; if no covariates have TjSα = 1, then terminate
the model selection procedure and return the current model. If at any juncture, there are two
indices j, k (or more) such that TjSα = TkSα = 1, the selection is made according to the larger
value of WjS ,WkS .
The use of TjSα to define the set of covariates eligible for entry into the model, and WjS
to select which eligible covariate actually enters is conceptually important: it dissociates and
highlights the two fundamental tasks of regularization and fitting.
To summarize, the algorithm for forward selection given the hypothesis tests (TjSα,WjS) is
now given formally.
Damian Kozbur/ATBFMS 9
Algorithm 2. Testing-Based Forward Model Selection
Initialize Set S = ∅For 1 6 k 6 p
If TjSα = 1 for some j /∈ SSet j ∈ arg maxWjS : TjSα = 1Update S = S ∪ j
Else Break
Set θ ∈ arg minθ:supp(θ)⊂S En[(yi − x′iθ)2].
3.2. Formal Analysis
This section formally states conditions on the hypothesis tests and conditions on the data before
analyzing properties of Algorithm 1. These conditions are measures of the quality of the given
testing procedure and the regularity of the data.
Condition 2 (Hypothesis Tests). There is an integer Ktest > s0 and constants α, δtest, ctest,
c′test, c′′test > 0 such that each of the following conditions hold.
1. The tests have power in the sense that
P (TjSα = 1 for every j, |S| 6 Ktest such that −∆jE(S) > ctest) > 1− 1
3δtest.
2. The tests control size in the sense that
P (TjSα = 1 for some j, |S| 6 Ktest such that −∆jE(S) 6 c′test) 6 α+1
3δtest.
3. The tests are continuous in the sense that
P(−∆jE(S) > −c′′test∆kE(S) for each j, k, |S| 6 Ktest such that
TjSα = 1 and WjS >WkS) > 1− 1
3δtest.
The constants ctest and c′test measure quantities related to the size and power of the tests and
provide a convenient language for subsequent discussion. The constant c′′test measures the extent
to which the test statistics WjS reflect the actual magnitude of ∆jE(S). Note again that the
hypothesis tests are considered simply tools for model selection, which coincidentally have many
properties in common with traditional inferential hypothesis tests.
Condition 3 (Regularity). Normalizations E[En[x2ij ]] = 1 hold for all j. The residuals decom-
pose into εi = εoi + εa
i where E[En[εoi2]] < ∞, E[En[εo
ixij ]] = 0 for all j, and E[En[εai2]] 6
12ϕmin(Ktest)(E[G])−1c′test. Finally, (80×ϕmin(Ktest)(E[G])−4c′′test
−3+ 1)s0 < Ktest.
Damian Kozbur/ATBFMS 10
Condition 3 imposes regularity conditions for the class of models considered in the follow-
ing theorem. First, εi is decomposed into an orthogonal component εoi and an approximation
component εai , each of which exhibits a different kind of regularity. The orthogonal compo-
nent is orthogonal to the covariates in the population. The approximation component need not
be orthogonal to the covariates, but its magnitude must be suitably controlled by the sparse
eigenvalues of E[G] and by the parameter c′test, which is a detection threshold for the profile of
hypothesis tests TjSα. This decomposition allows for approximately sparse models similar to the
framework of Belloni et al. (2012). The fact that εai need not be orthogonal to the covariates also
allows this framework to overlay onto many problems in traditional nonparametric econometrics.
Condition 3 also imposes conditions on the relative values of the sparse eigenvalues of E[G],
c′′test, s0, and Ktest. Note that Ktest measures the size of the set S ⊂ 1, ..., p over which the
hypothesis tests perform well, as defined by Condition 2. Consequently, this condition requires
that the hypothesis tests TjSα perform sufficiently well over sets S, which must be larger when
E[G] has small eigenvalues, when c′′test is small, or when s0 is large.
There are a few cases where Condition 3 can be simplified. If p > n, even though the empirical
Gram matrix is necessarily rank deficient, the population Gram matrix may be full rank. When
E[G] is full rank, then λmin(E[G]) may be used in place of ϕmin(Ktest)(E[G]). In addition, the
condition on εai implicitly imposes constraints on c′test and ϕmin(Ktest)(E[G])−1. When there is
no approximation error, this requirement is no longer needed.
Let
cT = s0ϕmin(Ktest)(E[G])−1ctest
c′T = 80×ϕmin(Ktest)(E[G])−4c′′test−3∣∣∣
c′′T(s) = ϕmax(s+ s0)(G)1/2ϕmin(s+ s0)(G)−1/2s1/2 ‖En[xiεi]‖∞+ 3ϕmax(s+ s0)(G)ϕmin(s+ s0)(G)−1/2(s+ s0)1/2ctest
1/2ϕmin(Ktest)(E[G])−1.
Theorem 4. Consider Dn ∼ P for some fixed n and TjSα,WjS such that Conditions 2 and
3 hold. Suppose θ is obtained by Algorithm 2. Then the bounds
E(S)− E(S0) 6 cT
s 6 (c′T + 1) s0
En[(x′iθ0 − x′iθ )2]1/2 6 c′′T(s )
hold with probability at least 1− α− δtest.
Theorem 4 provides finite sample bounds on the performance of TBFMS. In contrast to
Theorem 1, Theorem 4 also addresses the possibility that if covariate j is selected ahead of
covariate k, it is not necessarily the case that −∆jE(S) > −∆kE(S). This is done by making use
of the continuity constant c′′test in Condition 2.
Damian Kozbur/ATBFMS 11
Theorem 4 can be used to derive asymptotic estimation rates by allowing the constants to
change with n. The next subsection provides an example to a linear model with heteroskedastic
disturbances, where, under the stated regularity conditions, the prediction and estimation error
attain the rate O(s0 log p/n). This convergence rate matches typical Lasso and Post-Lasso rates.
The results aim to control the hybrid objectives, described in the introduction, of producing a
good fit and returning a sparse set of variables. One useful implication of bounds controlling both
s and En[(x′iθ0 − xiθ)2] is that the results can be applied to constructing uniformly valid post-
model selection inference procedures (see Belloni et al. (2014)), in which for some applications,
the prediction error bound alone is insufficient.
4. Examples and Extensions
This section describes an example application of Theorem 4. The main theoretical application
is an illustration with heteroskedastic data. For this setting, a TBFMS procedure is constructed
for which optimal convergence rates are proven.
4.1. Heteroskedastic Disturbances
This section gives an example of the use of Theorem 4 by illustrating an application of model
selection in the presence of heteroskedasticity in the disturbance terms εi. A TBFMS procedure
is constructed based on the Heteroskedasticity-Consistent standard errors described in White
(1980). The conditions required for the application of Theorem 4 are verified under low-level
conditions on data generating processes. Other TBFMS procedures are possible, and these are
discussed in the next section. The analysis begins with a formulation stated in Kozbur (2017b)
(which does not derive nor claim any theoretical properties.)
For shorthand, write xijS (with j /∈ S) to be the vector with components xik with k = j or
k ∈ S. To construct the tests, begin with the least squares estimate of the regression yi on xijS .
θjS = En[xijSx′ijS ]−1En[x′ijSyi]
Define εijS = yi − x′ijS θjS . One heteroskedasticity robust estimate of the sampling variance of
θjS , proposed in White (1980), is given by the expression
VjS =1
nEn[xijSx
′ijS ]−1Ψε
jSEn[xijSx′ijS ]−1
where
ΨεjS = En[ε 2
ijSxijSx′ijS ].
Define the test statistics
W hetjS = [VjS ]
−1/2jj
∣∣∣[θjS ]j
∣∣∣ .
Damian Kozbur/ATBFMS 12
Reject H0 for large values of W hetjS defined relative to an appropriately chosen threshold. To
define the threshold, first let ηjS := (1 ,−β′jS)′ where βjS is the coefficient vector from the least
squares regression of xijni=1 on xikni=1,k∈S . Then define
τjS =‖η′jSDiag(Ψε
jS)1/2‖1√η′jSΨε
jSηjS.
The τjS will be helpful in addressing the fact that many different model selection paths are
possible under different realizations of the data under P.6 Not taking this fact into account can
potentially lead to false discoveries. The next condition states precisely the hypothesis tests TjSα.
Definition 1 (Hypothesis Tests for Heteroskedastic Disturbances). Let cτ > 1 and α > 0 be
parameters. Assign WjS = W hetjS . Assign
TjSα = 1 ⇐⇒ WjS > cτ τjSΦ−1(1− α/p).
The term Φ−1(1 − α/p) can be informally thought of as a Bonferroni correction term that
takes into account the fact that there are p potential covariates. The term cτ τjS can be informally
thought of as a correction term that can account for the fact that the set S is random and can
have many potential realizations. The simulation study uses the settings cτ = 1.01 and α = .05.
Condition 4 (Regularity for Data with Heteroskedasticity). Consider a sequence of data sets
Dn = (xi, yi)ni=1 ∼ P = Pn. The observations (xi, yi) are i.n.i.d. across i and yi = x′iθ0 + εi for
some θ0 with s0 = o(n). The residuals decompose into εi = εoi + εa
i such that a.s., E[εoi |xi] = 0
and maxi |εai | = O(n−1/2). In addition, a.s., uniformly in i and n, E[ε4
i |xi] are bounded above
and E[ε2i |xi] is bounded away from zero. The covariates satisfy maxj6p En[x12
ij ] = O(1) with
probability 1− o(1). There is a sequence Kn, where s0 = o(Kn), and bounds ϕmin(Kn)(G)−1 =
O(1), ϕmax(Kn)(G) = O(1), ϕmin(Kn)(En[(εixi)(εixi)′])−1 = O(1), and max|S|<Kn,j /∈S ‖ηjS‖1 =
O(1), which hold with probability 1− o(1). The rate condition K4n log3 p/n = o(1) holds.
Condition 4, as before, gives conditions on the sparse eigenvalues, this time applying to both
G and to En[(εixi)(εix′i)]. In addition, Condition 4 assumes a bound on ηjS that may be strong
in some cases. Previous results in Tropp (2004), Zhang (2009) assume the strict condition that
maxj /∈S0‖ηjS0‖1 < 1, which is the genuine irrepresentability condition, in analysis of inductive
variable selection algorithms. Here, the requirement < 1 is replaced by the weaker requirement
= O(1). Other authors, for instance Meinshausen and Buhlmann (2006), use conditions analogous
to max|S|6Kn,j /∈S ‖ηjS‖1 = O(1) in the context of learning high-dimensional graphs, and note
that the relaxed requirement is satisfied by a much broader class of data-generating processes.
Analogous bounds on ‖ηjS‖1 were not required in Theorem 1, because its proof does not leverage
bounds relating WjS to the self-normalized sums En[x′ijεi]/√
En[x2ijε
2i ], j 6 p. Failure of the O(1)
6There is an unfortunate misprint in a Papers and Proceedings version of this paper, Kozbur (2017b), in which
the exponent 1/2 is missing from the term Diag(ΨεjS).
Damian Kozbur/ATBFMS 13
bound would lead only to slightly slower convergence rates. Condition 4 also states regularity
conditions on moments of εi and xi, which are useful for proving laws of large numbers, central
limit theorems, and moderate deviation bounds (see Jing et al. (2003)). Finally, the rate condition
controls relative sizes of s0, p, n because s0 < Kn.
Theorem 5. Consider a sequence of data sets Dn ∼ P = Pn which satisfies Condition 4.
Suppose that cτ > 1 is fixed independent of n, and that α = o(1) with nα → ∞.7 Let θ be the
estimate obtained from Algorithm 2 with tests defined by Definition 1. Then there are bounds
En[(x′iθ0 − x′iθ)2]1/2 = O
(√s0 log(p/α)
n
)and s = O(s0),
which hold with probability at least 1− α− o(1) as n→∞.
4.2. Additional TBFMS Formulations and Variants
In general, the quality of statistical performance of a variant of TBFMS may depend on the
structure of the data at hand through the size, power, and continuity properties of the tests, as
articulated in Condition 2. Theorem 4 is general and can thus potentially be applied for many
different types of TBFMS procedures, depending on how WjS and TjSα are defined in a particular
setting. Depending on the setting, some variants may have better size, power, and continuity
properties than others. This section describes several variants of the TBFMS procedure defined
in the previous section.
The definition of the first variant considered is based on the observation that there is a much
simpler formulation for the hypothesis tests that ignores the cτ τjS terms. This results in the
following definition.
Definition 2 (Simplified Hypothesis Tests for Heteroskedastic Disturbances). Let α > 0 be
parameters. Assign
TjSα = 1 ⇐⇒ Wjs > Φ−1(1− α/p).
These tests are based on a simple Bonferroni-type correction. Furthermore, though never
previously formally justified, TBFMS using the simpler tests is natively programmed in some
statistical software, including SPSS and Stata. It is unknown to the author at the time of this
writing whether the same convergence rates can be attained using the simpler tests under the
identical regularity as in Condition 4. This option is explored in some finite sample settings in
the simulation study that follows. Evidence from the simulation study suggests that this option
performs better than the more complex tests defined in Definition 1. The tests in Definition 2
are not necessarily more conservative than those in Definition 1.
The next variant TBFMS procedure for heteroskedastic data illustrates an important aspect of
the result of Theorem 4. Namely, Theorem 4 explicitly allows the researcher to easily dissociate
7Allowing α to be fixed is possible under more restrictive conditions on the approximation error terms εai . If
p > n, then the rate log(p/α) becomes equivalent to simply log(p).
Damian Kozbur/ATBFMS 14
the regularization component of high-dimensional estimation with the fitting component. For
instance, the following formulation may be described as slightly greedier than Definition 2.
Definition 3 (Fit-Streamlined Hypothesis Tests for Heteroskedastic Disturbances). Let α > 0
be a parameter. Assign WjS = ∆j`(S). Assign
TjSα = 1 ⇐⇒ W hetjS > Φ−1(1− α/p).
Under the more conservative tests, TjSα = 1 ⇐⇒ W hetjS > cτ τjSΦ−1(1 − α/p) for some
cτ > 1, the same convergence rates of this greedier TBFMS procedure under Condition 4 are
proven in the same way as Theorem 5 (by showing that the ratios ∆j`(S)/W hetjS are sufficiently
well behaved). For brevity, the details are omitted.
When the data is approximately homoskedastic, the tests defined in Definition 1 may be too
conservative and suffer in terms of power (noting that power is an explicit input into the bounds
in Theorem 4). In this case, tests of the following form can be considered using the homoskedastic-
based test statistics W homjS = [V hom
jS ]−1/2jj
∣∣∣[θjS ]j
∣∣∣ with V homjS = 1
nEn[ε2ijS ]En[xijSx
′ijS ]−1.
Definition 4 (Simplified Hypothesis Tests for Homoskedastic Disturbances). Let α > 0 be
parameters. Assign WjS = W homjS . Assign
TjSα = 1 ⇐⇒ Wjs > Φ−1(1− α/p).
Evidence from the simulation study suggests that this option performs better than the more
complex tests in Definition 1 when the data is homoskedastic, but not when heteroskedastic.
4.3. TBFMS with Baseline Covariates
The initialization statement of Algorithm 2 can be modified trivially so that a baseline set Sbase of
covariates are automatically included in S at the start of selection. In this case, the initialization
statement of Algorithm 2 is replaced with
Initialize. Set S = Sbase.
Under this modification, a direct analogue of Theorem 4 holds. It is proven using the same
arguments. Sparsity bounds can be calculated as in the proof of Theorem 4 by separately tracking
covariates S0\Sbase. This requires an appropriate adjustment to Condition 3 in which Ktest must
be larger, and in particular, bound a quantity depending on both s0 and |Sbase|. The proof of
the first and third statements of Theorem 4 do not depend on the initialization S = ∅. Such
modification is appropriate, for instance, in cases in which a researcher wishes to include a set of
covariates into a model, but is unsure of which interactions to include. In this case, TBFMS can
be used to help identify relevant interaction terms. This case is further explored in the empirical
application. Similarly, a constant term may be included automatically in the regression model.
Damian Kozbur/ATBFMS 15
4.4. Additional Discussion of Potential Variants
Analogous results potentially hold for dependent data using HAC-type standard errors (see
Newey and West (1987), Andrews (1991).) In addition, cluster-type standard errors for large-T -
large-n and fixed-T -large-n panels can be used by adapting arguments from Belloni et al. (2016).
Analogous results for homoskedastic disturbances can be derived as a corollary.
Another alternative is to consider generalized error rates. The conditions set forth in Condition
2 require control of a notion resembling family-wise error rate uniformly over hypothesis tests
H0 : ∆jE(S) = 0 for j 6 p and |S| < Ktest for some integer Ktest. Other types of error rates
like k-family-wise error rate, false discovery rate, or false discovery proportion are potentially
possible as well. In particular, the arguments in the proof of Theorem 4 would continue to be
compatible with procedures that controlled an appropriate notion of false discovery proportion.
In order to keep exposition concise, these extensions are not considered here.
5. Example: TBFMS for Asset-Based Poverty Mapping
This section investigates the use of TBFMS to develop improved proxy-means tests in an appli-
cation to poverty mapping in a Peruvian dataset covering years 2010–2011. This analysis extends
the original analysis in Hanna and Olken (2018), who estimated a predictive model of household
consumption using the same data. The data is from the Peruvian Encuesta Nacional de Hogares
(ENAHO), maintained by the Instituto Nacional de Estadistica e Informatica (INEI), Peru.
Fighting poverty is a major priority for many developing countries and international organi-
zations like the United Nations. Strategies for combating poverty often require that governing
bodies have accurate information about household level consumption, income, or other measures
of welfare. Methods for empirical identification of households and regions below a given poverty
line is are considered in e.g. Elbers et al. (2003), among others.
One method in which a government can obtain a signal about measures of welfare is termed a
proxy-means test. The implementation of a proxy-means test is usually based on large censuses
of the population, in which government enumerators obtain information on easily observable and
verifiable assets. The government uses these assets to predict incomes or per-capita consumption
or other measures of poverty or welfare by estimating a regression on a smaller sample with
detailed measurement of consumption. The proxy-means score is defined as the predicted income
or consumption, which is calculated using the results from the predictive regression. This method
is widespread, and is implemented in several countries including Indonesia, Pakistan, Nigeria,
Mexico, Philippines, Burkina Faso, Ecuador, Jamaica. Improved predictions may be helpful to
policy makers in deciding on strategies to eliminate poverty; see Fiszbein et al. (2009).
The data contain covariates which are indicator variables describing household-level assets
and which are used to predict outcomes y = Consumption (in 103 Peruvian Soles) as well as y =
log Consumption (in Peruvian Soles). The set of 46305 household observations is split randomly
(with equal probability) into a training sample of size 22674 and a testing sample of size 22704.
All estimation procedures considered are implemented on the training sample. The indicators
Damian Kozbur/ATBFMS 16
derive from factor variables describing a household’s (1) water source, (2) drain infrastructure
(3) wall material, (4) roof material, (5) floor material, (6) availability of electricity, (7) access
to telephone, (8) education of head of household, (9) type of insurance, (10) crowding, (11)
consumption of luxury items. See Hanna and Olken (2018) for more details.
Here, TBFMS is used to determine which interactions of underlying covariates described above
are helpful in developing an improved proxy-means test.8 For every (unordered) pair of indicator
variables A and B, three symmetric logic functions,
and(A,B), or(A,B), xor(A,B).
are generated. Together with a constant term, these logic functions linearly span exactly the set
of symmetric boolean functions on all pairs of indicators. In addition, for every unordered triple
of indicator variables, A,B and C, the function
and(A,B,C)
is generated. The final interaction expansion is based on the union of the generated functions.
Proxy-means tests are estimated with ordinary least squares (OLS) using un-interacted indi-
cator variables (replicating Hanna and Olken (2018)) and with four TBFMS estimators adding
interactions. The TBFMS estimators are TBFMS I–IV and use Algorithm 2 together with tests
of Definitions 1–4, with selection initialized, as described in Section 4.3, with baseline variables
Sbase consisting of original (un-interacted) indicators and a constant term. Tuning parameters
are set to cτ = 1.01, α = .05 in all cases. Including baseline indicators is natural because they are
ex-ante researcher chosen as relevant and because they are few in number relative to the sample
size. 10 baseline indicators of the 72 contained in the original data were excluded due to multi-
collinearity (after having generated interactions), leaving 62 baseline indicators (which includes
a constant term). Generated interaction terms with training sample standard deviation < 0.03
(corresponding to 20 or less 1s) were discarded. Interactions are residualized away from baseline
indicators on the training sample9. Residualizing only affects the terms τjS in TBFMS I, with
τjS typically becoming smaller, thus resulting in more selections into the model. Residualizing
does not affect the estimation TBFMS II—IV. The final number of generated interactions is
23964 which exceeds the training sample size of 22674.
Table 1 reports test sample mean square error (MSE), training sample MSE, relative and abso-
lute reduction in test sample MSE (compared to OLS), p-values and confidence sets for absolute
reduction in MSE based two-sided paired t-tests. Table 1 also reports summary statistics.
8Hanna and Olken (2018) also note that other estimation techniques may offer improvement. The reference
Nichols and McBride (2018) also investigates various cross validation and machine learning techniques toward
this end.9Residualizing does not involve the outcome variable and can be viewed similarly to the process of demeaning
covariates provided the number of baseline indicators is sufficiently small. Residualizing introduces a small amount
of dependence across observations as does demeaning. The analysis in the proof of Theorem 5 can be adjusted to
show that dependence arising from residualization away from a fixed, suitably small set of covariates has negligible
effects.
Damian Kozbur/ATBFMS 17
Table 1Asset-Based Poverty Estimation
∣∣∣ #
Var
TrainMSE
TestMSE
Rel(%)∆MSE
∆MSE p-val∆MSE
95% CI ∆MSE
A. y = Consumption (×103 Peruvian Soles)∣∣∣
EstimatesOLS 62 0.0780 0.0897 – – – – –TBFMS I 116 0.0727 0.0867 3.3503 0.0030 2.3E-08 ( 0.0020 0.0041 )TBFMS II 159 0.0697 0.0860 4.1264 0.0037 0.0002 ( 0.0018 0.0056 )TBFMS III 118 0.0654 0.0848 5.4865 0.0049 0.0002 ( 0.0023 0.0076 )TBFMS IV 118 0.0654 0.0848 5.4865 0.0049 0.0002 ( 0.0023 0.0076 )
B. y = log Consumption (Peruvian Soles)∣∣∣
EstimatesOLS 62 0.1916 0.1910 – – – – –TBFMS I 117 0.1849 0.1884 1.3954 0.0027 2.1E-07 ( 0.0017 0.0037 )TBFMS II 140 0.1829 0.1892 0.9669 0.0018 0.0109 ( 0.0004 0.0033 )TBFMS III 89 0.1818 0.1865 2.3809 0.0045 1.8E-10 ( 0.0031 0.0060 )TBFMS IV 89 0.1818 0.1865 2.3809 0.0045 1.8E-10 ( 0.0031 0.0060 )
Data description Training sample size : 22674, Test sample size : 22704# Baseline characteristics : 62, # Interacted characteristics : 23964Years collected : 2010–2011Source : Instituto Nacional de Estadistica e Infomatica (INEI)Encuesta Nacional de Hogares (ENAHO); also used in Hanna and Olken (2018)
Outcome distribution summary (unconditional)Q05 Q50 Q95 Mean Stdev Var
y = Consumption (×103 Peruvian Soles) 0.1015 0.3550 1.1582 0.4608 0.4136 0.1711y = log Consumption (Peruvian Soles) 4.6254 5.8766 7.0516 5.8566 0.7442 0.5539
Asset-based proxy-means estimation results. Estimates are presented for TBFMS I–TBFMS IV, OLS asdescribed in the text. The OLS estimator replicates analysis in Hanna and Olken (2018).
Each TBFMS estimate indicates significant improvement in test sample MSE relative to OLS
at the 5% level. Highest relative improvement are with TBFMS III and TBFMS IV, which
selected the same models both for y = log Consumption and y = Consumption. For y = Con-
sumption, in the test sample, TBFMS III, IV give a 5.4865% relative improvement to OLS
in MSE; 0.0049 absolute improvement, p-value = .0002 based on 2-sided paired t-test, 95% CI
(0.0023, 0.0076). For log Consumption, TBFMS III, IV give a 2.3809% relative improvement
to OLS in MSE, absolute 0.0045 improvement; p-value = 1.8E-10 based on 2-sided paired t-test,
95% CI (0.0031, 0.0060).
6. TBFMS Simulation Studies
The results in the previous sections suggest that estimation with TBFMS should produce quality
estimates in large sample sizes. This section conducts two simulation studies to evaluate the
finite sample performance of TBFMS relative to select other procedures commonly used in high-
dimensional regression problems in finite samples.
The simulation study compares the following estimators.
Damian Kozbur/ATBFMS 18
1. TBFMS I. Algorithm 2 with tests defined in Definition 1 with cτ = 1.01, α = .05.
2. TBFMS II. Algorithm 2 with simplified tests defined in Definition 2 with α = .05.
3. TBFMS III. Algorithm 2 with streamlined tests defined in Definition 3 with α = .05.
4. TBFMS IV. Algorithm 2 with homoskedastic tests defined in Definition 4 with α = .05.
5. Lasso-CV. Standard Lasso, with penalty parameter chosen using 10-fold cross validation.
θ is the minimizer of the Lasso objective function and is not refit on the selected model.
6. Post-Het-Lasso. Lasso implementation in Belloni et al. (2012), which is designed specifically
to handle heteroskedastic disturbances. This requires two tuning parameters to cτ and α
set to cτ = 1.01 and α = .05. The estimate θ is refit on the selected model.
7. Oracle Least squares regression on model consisting of S0 = j : [θ0]j 6= 0.
6.1. Simulation I
The first simulation study, Simulation I, evaluates TBFMS relative to several alternative estima-
tors on a series of performance metrics for high-dimensional sparse linear regression problems.
Simulation I considers data of the form Dn = (yi, xi)ni=1 with yi = x′iθ0 + εi. Samples Dn
are drawn from several data-generating processes reported in Table 2. Each considered data
generating process is characterized by parameters s0, ρ0, b0, n. The parameter s0 is the sparsity.
The parameter b0 controls the nature of the coefficient vector by θ0j = bj−10 1j6s0 . When b0 < 0,
the coefficients θ0j alternate sign in j and when |b0| < 1, the coefficients decay. The parameter
ρ0 controls the presence of heteroskedasticity in the disturbance terms εi. The terms εi are ho-
moskedastic when ρ0 = 0 and heteroskedastic otherwise. The dimensionality is always taken to
be double the sample size so that p = 2n. Each simulation design is replicated 1000 times.
The simulation results for Simulation I are reported in Table 3. The results track various
measures of estimation quality for the five estimators for fixed s0, b0, and ρ0, and for n. Columns
in Table 3 display (1) MPEN —mean prediction error norm defined as En[(x′iθ0 − x′iθ)2]1/2, (2)
RMSE—root mean square estimation error defined as ‖θ2 − θ0‖2, (3) MNCS—mean number of
correctly selected covariates from S0, (4) MSSS—mean size of selected set of covariates in all
cases averaged over simulation replications.
Table 3 indicates that predictive and estimation performance improve for all estimates as n
increase from 100 to 500. Lasso-CV selects an increasing number of variables. Outside the Oracle
estimator, for all n, TBFMS II attains the best predictive and estimation performance in set-
tings with heteroskedasticity. In Table 3, there is no single feasible estimator that dominates in
every setting in terms of estimation error or prediction error. However, in all settings, Lasso-CV
selects the most covariates (both in absolute terms and in terms of the number of correctly iden-
tified covariates), followed by TBFMS I - IV and Post-Het-Lasso. With alternating coefficients
(b0 = −0.5), TBFMS I and TBFMS II dominate Lasso-CV and Post-Het-Lasso on prediction
error and estimation error. This suggests that the performance of these estimators depends on
the configuration of the signal θ0 relative to the correlation structure of the covariates. Finally,
the relative difference in performance across estimators is larger in the presence of heteroskedas-
Damian Kozbur/ATBFMS 19
ticity. In the presence of heteroskedasticity, the Post-Het-Lasso exhibits faster improvement in
estimation error and prediction error with increasing n, though it is still dominated by the other
estimators. Note that each of the techniques, TBFMS I and Post-Het-Lasso, are theoretically
valid for sequences of data-generating processes with heteroskedasticity. In addition, the prop-
erties of cross-validation with Lasso are only beginning to be understood (see Chetverikov et al.
(2016) for analysis of Lasso with cross-validation). But it is seen from this simulation study that
Lasso-CV leads to selection of substantially more covariates selected.
6.2. Simulation II
This section performs a second simulation study, Simulation II, on the use of TBFMS in selecting
control covariates for the estimation of the effect of a covariate of interest on an outcome from
a large set of potential observable controls.
Simulation II considers data of the form Dn = (yi, xi, wi)ni=1 ∼ Pn where yi ∈ R are
outcome variables, xi ∈ R are variables of interest, and wi ∈ Rp are controls. In particular,
yi = xiβ0 + w′iθ10 + εi
xi = w′iθ20 + ui
for some parameters β0 ∈ R, θ10, θ
20 ∈ Rp with E[εi|wi, xi] = 0 and E[ui|wi] = 0. Here, the impact
of the policy/treatment variable xi on the outcome yi is measured by the unknown parameter
β0 which is the target of inference. The wi are potentially important conditioning variables. The
confounding factors wi affect xi via the function w′iθ20 and the outcome variable via the function
w′iθ10. Both of the parameters θ1
0 and θ20 are unknown. As in Simulation I, specific data-generating
processes used in Simulation II depend on parameters s0, b0, ρ, n and are given by Table 2.
The structure of the TBFMS procedure applied to a specific linear regression problem ensures
that any coefficient in that problem, unless reliably distinguishable from zero, is estimated to
be exactly zero. This property complicates inference after model selection in sparse models that
may have a set of covariates with small but non-zero coefficients. The use of TBFMS for a
linear regression yi on (xi, zi), possibly initializing the S to including xi at the start of model
selection, may result in excluding important conditioning covariates, which may lead to non-
negligible omitted variables bias of parameters of interest. As a result, inference which does not
take account the possibility of such model selection mistakes can be distorted. This intuition
is formally developed in Leeb and Potscher (2008). Several recent papers offer solutions to this
problem; see, for example, Zhang and Zhang (2014), Belloni et al. (2014), van de Geer et al.
(2014); Javanmard and Montanari (2014).
To estimate β0 in this environment, adopt the post-double-selection method of Belloni et al.
(2014) in conjunction with TBFMS. This method proceeds by first substituting to obtain pre-
dictive relationships for the outcome yi and the treatment xi in terms of only control variables:
yi = w′iθRF0 + vi
xi = w′iθFS0 + ui
Damian Kozbur/ATBFMS 20
with θFS0 = θ2
0, referred to as the first stage (FS) coefficient, and θRF0 = θ1
0 + β0θ20, referred to
as the reduced form (RF) coefficient. TBFMS is applied to each of the above two equations to
select one set of variables that are useful for predicting yi and another set of variables useful for
predicting xi. Once this is done, the union of the selected sets will index the final set of control
variables. Belloni et al. (2014) develop and discuss the post-double-selection method in detail.
Using two model selection steps guards against distorted inference and guarantees that variables
excluded in both model selection steps have a negligible contribution to omitted variables bias.
Analogously to Simulation Study I, Seven estimators are considered for estimation and are
named TBFMS I –TBFMS IV, Lasso-CV, Post-Het-Lasso, and Oracle. The estimators differ only
in that they replace TBFMS I with a different model selection technique in selecting covariates
into the final estimated model as described above. Final variance estimates V for β are based
on HC3 standard errors White (1980). For each estimator and simulation setting, the bias and
standard deviation of the point estimates, and coverage probability and average interval length
are computed over 1000 simulation replications. Results are shown in Table 4.
The simulation results indicate that across the data-generating processes considered, TBFMS
estimators largely achieve near-Oracle coverage probabilities. TBFMS IV -based estimates ex-
hibit some size distortion in heteroskedastic settings. In all settings, bias, standard deviation, and
interval lengths of TBFMS I – III closely resemble the Oracle estimator. In some simulations
(notably in Panel A) there is a large difference in coverage probabilities between the TBFMS
estimates and the Post-Het-Lasso estimate. Though the Post-Het-Lasso-based confidence sets
are asymptotically uniformly valid, and are theoretically robust against model selection mistakes,
finite sample model selection properties remain important. Interestingly, in this case, using the
relaxed penalty level with Lasso-CV does not improve coverage probability.
Table 2Design for Simulation Studies I and II
I. High-Dimensional Prediction. II. High-Dimensional Controls.
Data: Dn = (yi, xi)ni=1 iid Data: Dn = (yi, xi, wi)
ni=1 iid
DGP: yi = xiθ0 + εi DGP: yi = xiβ0 + w′iθ10 + εi, xi = w′iθ
20 + ui
p = dim(xi) = 2n p = dim(wi) = 2n
θ0j = bj−10 1j6s0 θ10j = bj−1
0 1j6s0 , θ20j = sin(j)1j6s0
xij ∼ N(0, 1), corr(xij , xik) = 0.5|j−k| wij ∼ N(0, 1), corr(wij , wik) = 0.5|j−k|
εi ∼ σiN(0, 1) (εi, ui) ∼ σiN(
0,
(1 00 1
))σi = exp(ρ0
∑pj=1 0.75(p−j)xij) σi = exp(ρ0
∑pj=1 0.75(p−j)wij)
Settings: s0 = 6, b0 ∈ −0.5, 0.5, ρ0 ∈ 0, 0.5 Settings: s0 = 6, b0 ∈ −0.5, 0.5, ρ0 ∈ 0, 0.5n ∈ 100, 500 n ∈ 100, 500
Damian Kozbur/ATBFMS 21
Table 3Simulation I Results: Prediction in the Linear Model
n = 100 n = 500∣∣∣MPEN RMSE MNCS MSSS MPEN RMSE MNCS MSSS
A. ρ0 = 0 : Homoskedastic, s0 = 6 : High Sparsity, b0 = −0.5 : Alternating Coefficients
TBFMS I 0.313 0.484 1.470 1.470 0.146 0.218 2.599 2.599TBFMS II 0.192 0.281 2.330 2.444 0.094 0.132 3.403 3.478TBFMS III 0.194 0.286 2.307 2.307 0.094 0.133 3.432 3.432TBFMS IV 0.191 0.281 2.301 2.358 0.093 0.132 3.400 3.461Post-Lasso 0.417 0.640 0.958 0.958 0.401 0.617 1.000 1.000Lasso-CV 0.285 0.468 2.850 21.463 0.173 0.275 3.693 40.617Oracle 0.117 0.149 6.000 6.000 0.053 0.066 6.000 6.000
B. ρ0 = 0.5 Heteroskedastic, s0 = 6 : High Sparsity, b0 = −0.5 : Alternating Coefficients
TBFMS I 0.507 0.720 0.815 0.815 0.395 0.593 1.077 1.077TBFMS II 0.478 0.671 0.994 1.012 0.276 0.401 1.765 1.780TBFMS III 0.526 0.734 0.824 0.824 0.294 0.429 1.650 1.650TBFMS IV 0.529 0.726 0.896 0.973 0.299 0.425 1.713 1.807Post-Het-Lasso 0.884 1.147 0.015 0.015 0.581 0.846 0.867 0.867Lasso-CV 0.594 0.849 1.242 8.863 0.438 0.661 1.670 18.326Oracle 0.379 0.482 6.000 6.000 0.180 0.222 6.000 6.000
C. ρ0 = 0 : Homoskedastic, s0 = 6 : High Sparsity, b0 = 0.5 : Positive Coefficients
TBFMS I 0.307 0.395 2.331 2.331 0.147 0.186 3.396 3.396TBFMS II 0.193 0.244 3.094 3.193 0.091 0.113 4.078 4.152TBFMS III 0.193 0.244 3.064 3.064 0.091 0.114 4.105 4.105TBFMS IV 0.191 0.241 3.062 3.109 0.091 0.114 4.068 4.128Post-Het-Lasso 0.782 0.583 1.174 1.174 0.615 0.468 2.121 2.121Lasso-CV 0.207 0.204 4.570 15.392 0.109 0.099 5.257 20.100Oracle 0.117 0.149 6.000 6.000 0.053 0.066 6.000 6.000
D. ρ0 = 0.5 : Heteroskedastic, s0 = 6 : High Sparsity, b0 = 0.5 : Positive Coefficients
TBFMS I 0.665 0.789 1.274 1.274 0.405 0.499 1.976 1.976TBFMS II 0.513 0.617 1.759 1.780 0.285 0.346 2.580 2.590TBFMS III 0.577 0.662 1.505 1.505 0.310 0.369 2.400 2.400TBFMS IV 0.580 0.670 1.574 1.656 0.314 0.373 2.478 2.574Post-Het-Lasso 1.314 1.014 0.288 0.288 0.879 0.663 1.898 1.898Lasso-CV 0.570 0.522 3.125 11.661 0.334 0.289 3.873 16.337Oracle 0.379 0.482 6.000 6.000 0.180 0.222 6.000 6.000
Simulation results for estimation in the design described in Table 2 with s0 = 6, b0 ∈ −0.5, 0.5, ρ0 ∈ 0, 0.5,and n ∈ 100, 500. Estimates are presented for the estimators, TBFMS I– TBFMS IV, Post-Het-Lasso,Lasso-CV, and Oracle described in the text. Columns display (1) MPEN—mean prediction error norm defined
as En[(x′iθ0 − x′iθ)2]1/2, (2) RMSE—root mean square estimation error defined as ‖θ2 − θ0‖2, (3)MNCS—mean number of correctly selected covariates from S0, (4) MSSS—mean size of selected set ofcovariates in all cases averaged over simulation replications. Figures are based on 1000 simulation replications.
Damian Kozbur/ATBFMS 22
Table 4Simulation II Results: Control Selection in the Linear Model
n = 100 n = 500∣∣∣ Bias StDev Length Cover Bias StDev Length Cover
D. ρ0 = 0 : Homoskedastic, s0 = 6 : High Sparsity, b0 = −0.5 : Alternating Coefficients
TBFMS I -0.035 0.129 0.354 0.772 -0.036 0.060 0.182 0.797TBFMS II -0.009 0.114 0.428 0.934 0.000 0.047 0.178 0.941TBFMS III -0.016 0.112 0.424 0.928 -0.000 0.047 0.178 0.941TBFMS IIV -0.011 0.112 0.425 0.929 -0.000 0.047 0.178 0.940Post-Het-Lasso -0.190 0.057 0.210 0.068 -0.193 0.025 0.094 0.000Lasso CV -0.193 0.054 0.214 0.067 -0.193 0.025 0.094 0.000Oracle 0.002 0.105 0.424 0.956 -0.001 0.047 0.178 0.941
B. ρ0 = 0.5 : Heteroskedastic, s0 = 6 : High Sparsity, b0 = −0.5 : Alternating Coefficients
TBFMS I 0.032 0.326 1.016 0.881 0.003 0.219 0.752 0.927TBFMS II 0.010 0.373 1.179 0.901 0.005 0.231 0.806 0.929TBFMS III 0.010 0.366 1.141 0.897 0.004 0.231 0.801 0.928TBFMS IV 0.009 0.372 1.167 0.899 0.005 0.231 0.803 0.929Post-Het-Lasso -0.111 0.183 0.553 0.721 -0.104 0.129 0.421 0.650Lasso CV -0.116 0.209 0.637 0.745 -0.104 0.130 0.422 0.651Oracle 0.004 0.406 1.333 0.909 0.009 0.234 0.820 0.934
C. ρ0 = 0 : Homoskedastic, s0 = 6 : High Sparsity, b0 = 0.5 : Positive Coefficients
TBFMS I 0.012 0.122 0.362 0.814 -0.054 0.059 0.184 0.729TBFMS II -0.012 0.113 0.426 0.929 0.000 0.044 0.179 0.960TBFMS III -0.019 0.112 0.423 0.926 -0.000 0.044 0.179 0.961TBFMS IV -0.014 0.113 0.423 0.920 0.000 0.044 0.179 0.960Post-Het-Lasso -0.088 0.061 0.234 0.669 -0.085 0.027 0.102 0.109Lasso CV -0.086 0.059 0.234 0.680 -0.085 0.027 0.102 0.109Oracle 0.002 0.102 0.420 0.954 -0.001 0.044 0.179 0.960
D. ρ0 = 0.5 : Heteroskedastic, s0 = 6 : High Sparsity, b0 = 0.5 : Positive Coefficients
TBFMS I -0.032 0.334 1.020 0.857 0.009 0.250 0.763 0.918TBFMS II -0.016 0.371 1.196 0.882 -0.001 0.259 0.811 0.923TBFMS III -0.010 0.362 1.153 0.884 -0.001 0.257 0.807 0.920TBFMS IV -0.007 0.363 1.176 0.885 0.000 0.259 0.809 0.920Post-Het-Lasso -0.161 0.216 0.596 0.638 -0.044 0.154 0.427 0.859Lasso CV -0.063 0.203 0.641 0.903 -0.044 0.154 0.427 0.858Oracle -0.019 0.397 1.352 0.895 0.003 0.263 0.826 0.924
Simulation results for estimation in the design described in Table 2 with s0 = 6, b0 ∈ −0.5, 0.5, ρ0 ∈ 0, 0.5,and n ∈ 100, 500. Estimates are presented for the estimators, TBFMS I– TBFMS IV, Post-Het-Lasso,Lasso-CV, and Oracle described in the text. Columns display (1)Bias—bias of the respective estimates for β0,(2) ’textitStDev—standard deviation of the respective estimates for β0, (3) Length—length of confidenceintervals for β0, (4) Cover—coverage probabilities of the respective 95% confidence intervals for β0. Figures arebased on 1000 simulation replications.
Damian Kozbur/ATBFMS 23
7. Conclusion
This paper has considered TBFMS for high-dimensional sparse linear regression problems. The
procedure is shown to achieve estimation rates matching those of Lasso and Post-Lasso under a
broad class of data-generating processes.
Appendix A: Proofs
This appendix proves Theorems 1 and 4. As TBFMS is a greedy procedure which is not the
resulting solution of a simple optimization problem, the proofs establishing the properties of
TBFMS cannot refer to any global optimality conditions. This fact limits the applicability of
common m-estimation arguments or arguments for similar bounds for Lasso, and requires the
development of certain different techniques.
In the course of the proofs, several important results along the way are recorded as lemmas.
Lemmas which do not follow immediately from arguments in this section are proven in the online
supplemental material, Supplement to “Analysis of Testing-Based Forward Model Selection.”
In addition, Theorems 2, 3 and 5 are proven in the online supplemental material.
A.1. Proof of the First Statement of Theorem 1
The first statement of Theorem 1 is proven by creating an appropriate analogue of the basic
inequality10 from standard Lasso analysis. Specifically, the following lemma holds.
Lemma 1. `(S ∪ S0) 6 `(θ0).
Lemma 1 holds by `(S ∪ S0) 6 `(S0) 6 `(θ0). Once the analogue basic inequality is noted,
`(θ) can be related to `(θ0) with a bound that depends on s0, t, and ϕmin(s + s0)(G)−1. The
following is Lemma 3.3 in Das and Kempe (2011).
Lemma 2 (Das and Kempe (2011)). `(S)− `(S ∪S0) 6 ϕmin(s+ s0)(G)−1∑j∈S0\S(−∆j`(S)).
Using the fact that t is the threshold in Algorithm 1, and thus −∆j`(S) 6 t, gives the further
bound `(S) − `(S ∪ S0) 6 s0tϕmin(s + s0)(G)−1. Applying the basic inequality along with the
fact that `(S) = `(θ) gives `(θ) 6 `(θ0) + s0tϕmin(s + s0)(G)−1. Expanding the quadratics `(θ)
and `(θ0), and applying arguments analogous to those in Lemma 6 in Belloni et al. (2012), gives
Lemma 3. En[(x′iθ0 − x′iθ)2]1/2 6 cF(s).
Lemma 3 is the first statement of Theorem 1.
A.2. Proof of First and Third Statements of Theorem 4
Let T be the event implied by Condition 2. Then P(T) > 1 − α − 3δtest/3 = 1 − α − δtest.
The rest of the proof works on the event T. If Algorithm 2 terminates at Ktest steps or ear-
10For Lasso estimation with penalty level λ, the basic inequality asserts that `(θ) + λ‖θ‖1 6 `(θ0) + λ‖θ0‖1.
Damian Kozbur/ATBFMS 24
lier, then it terminates at a step with −∆jE(S) 6 ctest for every j /∈ S. Similarly to the
proof of Theorem 1 above, Lemma 3.3 in Das and Kempe (2011) yields |E(S0) − E(S)| 6ϕmin(Ktest)(E[G])−1
∑j∈S0\S −∆jE(S) 6 s0ctestϕmin(Ktest)(E[G])−1. It is shown in the next
section that s 6 Ktest on T, completing the proof of the first statement of Theorem 4. Next,
Lemma 4. En[(x′iθ0 − x′iθ)2]1/2 6 c′′T(s).
Lemma 4 is the third statement of Theorem 4.
A.3. Proof of Sparsity Bounds for Theorems 1 and 4
The sparsity bounds in Theorems 1 and 4 are proven together. In the case of Theorem 1, the
covariates xj = (x1j , ..., xnj)′, outcome y = (y1, ..., yn)′ as well as disturbances ε = (ε1, ..., εn)′
are considered elements of the Hilbert space Rn with inner product 〈a, b〉 = n× En[aibi]. In the
case of Theorem 4, xj , y, ε are elements of the Hilbert space L2(Ω,Rn), of P-square-integrable
random vectors taking values in Rn with Ω an underlying probability space, with inner product
〈a, b〉L2(Ω,Rn) = n×EEn[aibi]. The notation H is used to denote the appropriate of these Hilbert
spaces according to the cases of H = Rn for Theorem 1 and H = L2(Ω,Rn) for Theorem 4. In
addition, let tH = t and GH = G in the case of Theorem 1 and let tH = c′test and GH = E[G] in
the case of Theorem 4. In the case of Theorem 4, the arguments that follow hold on T, the event
defined by Condition 2 (see also the previous subsection).
A.3.1. Two Orthogonalizations
Let vk ∈ H, k = 1, ..., s0 denote true covariates which refer to xj for j ∈ S0. The term false
covariates refers to those xj for which j /∈ S0. Consider the step after which there are exactly m
false covariates selected into the model. These are denoted w1, ..., wm, ordered according to the
order they were selected, and indexed by the set A = 1, ...,m.Apply Gram-Schmidt orthogonalization to v1, ..., vs0 , ε with the inner product from H. The
ordering is according to selection into S. Any true covariates unselected at m false covariate
selections are temporarily ordered arbitrarily after the selected true covariates. ε is placed last.
This yields a new set of unit-norm elements
v1, ..., vs0 , ε 7→ v1, ..., vs0 , ε ∈ H.
This orthogonalization also yields parameters
θ = (θ1, ...θs0)′ ∈ Rs0 , θε ∈ R satisfying y = v1θ1 + ...+ vs0 θs0 + θεε.
Reorder the unselected covariates vk such that for any unselected true covariate, θk > θl whenever
l > k. No additional orthogonalization is performed.
Apply a separate orthogonalization to w1, ..., wm ∈ H. These are orthogonalized by the Gram-
Schmidt process according to order of inclusion into S, with true covariates included (interspersed
Damian Kozbur/ATBFMS 25
according to when they were selected between the wj) in the orthogonlization process. The
resulting Gram-Schmidt-orthogonalized elements are renormalized to give
w1, ..., wm; v1, ..., vs0 7→ w1, ..., wm ∈ H
such that the component of each wj orthogonal to the span of v1, ..., vs0 in H has unit norm. This
renormalization is possible whenever ϕmin(m + s0)(GH) > 0. Therefore, wj can be decomposed
into wj = rj + uj which satisfy rj ∈ span(v1, ..., vs0) and uj ∈ span(v1, ..., vs0)⊥ and ‖uj‖H = 1.
For ease of reading, the online supplemental material also presents additional descriptive
notation and details about the orthogonalization constructions.
The two orthogonalizations are next related to each other by associating, for each j ∈ A,
parameters
j 7→ (γj ∈ Rs0 , γjε ∈ R)
which are defined as γjk = 〈wj , vk〉H, γjε = 〈wj , ε〉H. Assume without loss of generality that each
component of θ is positive (the remainder of the proof does not depend on the sign assigned
during orthogonalization). Similarly, assume without loss of generality that γ′j θ > 0.
A large part of the following analysis is at the level of the parameters θ, θε, γj , γjε. Therefore,
some remarks are helpful. Aside from relating the two orthogonalizations, these parameters also
encode information about incremental loss for various j and sets S. For example, in the case of
Theorem 1, it can be shown that −∆j`(S) = 1n
1‖wj‖2H
(θ′γj+ θεγjε)2 if S is the set of all covariates
selected into S before wj . Similarly, −∆k`(S) = 1n θ
2k if S corresponds to the set v1, ..., vk−1.
Relatedly, note that if γ′j θ is large for sufficiently many j, then some dependence between γj may
be anticipated; see Tao (2014), for a general discussion of partial transitivity of correlation. This,
however, heuristically creates tension with the fact that γj arise from orthogonalized covariates.
A.3.2. Main Sparsity Bounds
Divide the set A of false covariates into two sets A1 and A2, with cardinalities m1 and m2, on
the basis of the magnitude of γjε. Set A1 =j : |γjε|2 6 tHn
3‖ε‖H
, A2 = A \ A1. Note that large
values of γjε indicate higher dependence between orthogonalized versions of wj and ε. Bounds
on the size of A1 are given first.
Let A1k be the set of j ∈ A1 such that j is selected prior to the k-th true selection, but not
prior to any earlier true selections. For j ∈ A1, k, l ∈ S0, let C1 > 0 and 1 > C2 > 0 be constants
which satisfy
γ′j θ > θkC1 for j ∈ A1k, and θk > θlC2 for l > k.
The two key constants C1,C2 encode information about relative incremental loss values at
various points of the forward selection procedure. Lemma 5 calculates suitable C1, C2.
Lemma 5. γ′j θ > θk(
16ϕmin(m+ s0)(GH)
)1/2for j ∈ A1k. In addition, in the case of
Theorem 1, θk > θlϕmin(m + s0)(GH)1/2 for l > k. In the case of Theorem 4, θk >
θl (c′′testϕmin(m+ s0)(GH))
1/2for l > k.
Damian Kozbur/ATBFMS 26
Define the following two s0 × s0 matrices.
Γ : Γkl =∑j∈A1k
γjl, B : symmetric, Bkl = θl/θk if l > k.
Empty sums are taken to be 0. In case θk = 0 for some k, θl/θk is defined to be 1. The above
definitions of Γ, B are useful because the diagonal elements of the product ΓB satisfy the equality
[ΓB]kk =∑j∈A1k
γ′j θ/θk. This follows from Γ being upper triangular (by the orthogonalization
construction) and from the fact that A1k is empty if θk = 0 (see remark in the proof of Lemma
5).
The definition of C1 implies that [ΓB]kk > C1|A1k| and subsequently
tr(ΓB) > C1m1.
The product ΓB has a convenient decomposition. Consider the set
Gs0 =
Z ∈ Rs0×s0 :
Zkl = 〈Xk,Yl〉H1for some elements ‖Xk‖H1
, ‖Yl‖H16 1
in some s0-dimensional real Hilbert space H1
Lemma 6. The matrix product ΓB may be expressed as ΓB = ΓC3Z where C3 is a constant
which may be taken as C3 = C−22 and where Z ′ ∈ Gs0 .
This decomposition is helpful because of the following result due to Grothendieck.
Lemma 7 (Grothendieck (1953)). supZ∈Gs0 tr(MZ) 6 KRG‖M‖∞→1.
Here, KRG is an absolute constant which is known to be less than 1.783. Importantly, it
does not depend on s0. The notation ‖ · ‖∞→1 indicates the operator norm for bounded linear
operators L∞ → L1. When the matrix M is s0 × s0 dimensional, the implied L∞, L1 spaces are
L∞(1, ..., s0), L1(1, ..., s0) or equivalently (Rs0 , ‖ · ‖∞), (Rs0 , ‖ · ‖1). The form used here is
that described in Guedon and Vershynin (2016), Equation 3.2. Therefore,
C−13 C1m1 6 C−1
3 tr(ΓB) = C−13 tr(ΓC3Z) = tr(Γ′Z ′) 6 KR
G‖Γ′‖∞→1.
In light of this lower bound on ‖Γ′‖∞→1, there exists ν ∈ −1, 1s0 such that
‖Γ′ν‖1 >(KRG
)−1C3−1C1m1.
On the other hand, ‖Γ′ν‖1 may be upper bounded by a quantity that depends on s0 by
‖Γ′ν‖1 6 s1/20 ‖Γ′ν‖2.
A key property of the γj , which constitute Γ, is that they are approximately orthogonal to
each other in the sense of the following lemma. In particular, signed sums of γj scale in norm
like m1/21 up to a factor depending on ϕmin(m+ s0)(GH).
Lemma 8. For any signs ej ∈ −1, 1,∥∥∑
j∈A1ej γj
∥∥26 m1
1/2ϕmin(m+ s0)(GH)−1/2.
Damian Kozbur/ATBFMS 27
Therefore, ‖Γ′ν‖2 = ‖∑s0k=1
∑j∈A1k
νkγj‖2 6 m1/21 ϕmin(m+ s0)(GH)−1/2, which, when com-
bined with the bound ‖Γ′ν‖1 6 s1/20 ‖Γ′ν‖2, immediately implies the following.
Lemma 9. m1 6 ϕmin(m+ s0)(GH)−1C−21 C3
2(KRG
)2s0.
Having controlled m1, it is left to give a bound which controls m2. The following lemma is
proven by showing that the orthogonalization process wj 7→ wj cannot create too many variables,
j, with large γjε, given the relevant regularization condition is met.
Lemma 10. m2 6 3(m1 + s0) provided t1/2 > 2ϕmin(m + s0)(G)−1‖En[xiεi]‖∞ in the case of
Theorem 1 and E[En[εa2i ]] 6 1
2ϕmin(m+ s0)(E[G])−1c′test in the case of Theorem 4.
The final lemma restates the sparsity bounds of Theorems 1 and 4. Its proof involves only
assembling the previous arguments. Recall that m1 +m2 = m and that m is the number of false
selections being considered.
Lemma 11. In the case of Theorem 1, if t1/2 > 2ϕmin(m + s0)(G)−1‖En[xiεi]‖∞ holds, then
also m 6 80 × ϕmin(m + s0)(G)−4s0 holds. In the case of Theorem 4, s 6 (80 × ϕmin(m +
s0)(E[G])−4c′′−3test + 1)s0.
This completes the proof of Theorems 1 and 4.
References
D. W. K. Andrews. Heteroskedasticity and autocorrelation consistent covariance matrix estima-
tion. Econometrica, 59(3):817–858, 1991.
J. Bai and S. Ng. Forecasting economic time series using targeted predictors. Journal of Econo-
metrics, 146:304–317, 2008.
A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse
models. Bernoulli, 19(2):521–547, 2013.
A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal
instruments with an application to eminent domain. Econometrica, 80:2369–2429, 2012.
A. Belloni, V. Chernozhukov, and C. Hansen. Inference on treatment effects after selection
amongst high-dimensional controls. Review of Economic Studies, 81(2):608–650, 2014.
A. Belloni, V. Chernozhukov, C. Hansen, and D. Kozbur. Inference in high-dimensional panel
models with an application to gun control. Journal of Business & Economic Statistics, 34(4):
590–605, 2016.
D. Bertsimas, A. King, and R. Mazumder. Best subset selection via a modern optimization lens.
Annals of Statistics, 44(2):813–852, 04 2016.
P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.
Annals of Statistics, 37(4):1705–1732, 2009.
P. Buhlmann. Boosting for high-dimensional linear models. Annals of Statistics, 34(2):559–583,
2006.
Damian Kozbur/ATBFMS 28
P. Buhlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and
Applications. Springer, 2011.
E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than
n. Annals of Statistics, 35(6):2313–2351, 2007.
D. Chetverikov, Z. Liao, and V. Chernozhukov. On cross-validated Lasso. ArXiv e-prints, 2016.
A. Chudik, G. Kapetanios, and M. H. Pesaran. A one covariate at a time, multiple testing
approach to variable selection in high-dimensional linear regression models. Econometrica, 86
(4):1479–1512, 2018.
A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse
approximation and dictionary selection. In L. Getoor and T. Scheffer, editors, Proceedings of
the 28th International Conference on Machine Learning (ICML-11). ACM, 2011.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of statistics,
32(2):407–451, 2004.
M. Efroymson. Stepwise regression–a backward and forward look. Florham Park, New Jersey,
1966.
C. Elbers, J. O. Lanjouw, and P. Lanjouw. Micro–level estimation of poverty and inequality.
Econometrica, 71(1):355–364, 2003.
A. Fiszbein, N. Schady, F. Ferreira, M. Grosh, N. Keleher, P. Olinto, and E. Skoufias. Conditional
Cash Transfers: Reducing Present and Future Poverty. The World Bank, 2009.
W. Fithian, J. Taylor, R. Tibshirani, and R. Tibshirani. Selective Sequential Model Selection.
ArXiv e-prints, Dec. 2015.
I. E. Frank and J. H. Friedman. A statistical view of some chemometrics regression tools.
Technometrics, 35(2):109–135, 1993.
Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In L. Saitta, editor,
Proceedings of the Thirteenth International Conference on Machine Learning (ICML 1996),
pages 148–156. Morgan Kaufmann, 1996. ISBN 1-55860-419-7.
A. Grothendieck. Resume de la theorie metrique des produits tensoriels topologiques. Boletim
Sociedade de Matematica de Sao Paulo, 8:1 – 79, 1953.
M. G. G’Sell, S. Wager, A. Chouldechova, and R. Tibshirani. Sequential selection procedures
and false discovery rate control. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 78(2):423–444, 2016.
O. Guedon and R. Vershynin. Community detection in sparse networks via Grothendieck’s
inequality. Probability Theory and Related Fields, 165, 2016.
R. Hanna and B. A. Olken. Universal basic incomes versus targeted transfers: Anti-poverty
programs in developing countries. Journal of Economic Perspectives, 32(4):201–26, 2018.
T. Hastie, R. Tibshirani, and J. Friedman. Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, New York, NY, 2009.
T. Hastie, R. Tibshirani, and T. R. Extended Comparisons of Best Subset Selection, Forward
Stepwise Selection, and the Lasso. ArXiv e-prints, July 2017.
J. Huang, J. L. Horowitz, and F. Wei. Variable selection in nonparametric additive models. Ann.
Damian Kozbur/ATBFMS 29
Statist., 38(4):2282–2313, 2010. ISSN 0090-5364.
A. Javanmard and A. Montanari. Confidence intervals and hypothesis testing for high-
dimensional regression. Journal of Machine Learning Research, 15:2869–2909, 2014.
B.-Y. Jing, Q.-M. Shao, and Q. Wang. Self-normalized Cramer-type large deviations for inde-
pendent random variables. Annals of Probability, 31(4):2167–2215, 2003.
D. Kozbur. Sharp Convergence Rates for Forward Regression in High-Dimensional Sparse Linear
Models. ArXiv e-prints, Feb. 2017a.
D. Kozbur. Testing-based forward model selection. American Economic Review, 107(5):266–69,
May 2017b.
H. Leeb and B. M. Potscher. Can one estimate the unconditional distribution of post-model-
selection estimators? Econometric Theory, 24(2):338–376, 2008. ISSN 0266-4666. .
A. Li and R. F. Barber. Accumulation tests for fdr control in ordered hypothesis testing. Journal
of the American Statistical Association, 112(518):837–849, 2017.
Y. Luo and M. Spindler. L2 Boosting for Economic Applications. ArXiv e-prints, Feb. 2017.
N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with the
lasso. Annals of Statistics, 34:1436–1462, 2006.
W. K. Newey and K. D. West. A simple, positive semi-definite heteroskedasticity and autocor-
relation consistent covariance matrix. Econometrica, 55(3):703–708, 1987.
A. Nichols and L. McBride. Retooling poverty targeting using out-of-sample validation and
machine learning. World Bank Economic Review, 32(3):531–550, 2018.
M. Rudelson and R. Vershynin. On sparse reconstruction from fourier and gaussian measure-
ments. Communications on Pure and Applied Mathematics, 61:10251045, 2008.
M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE Trans-
actions on Information Theory, 59:3434–3447, 2013.
T. Tao. When is correlation transitive? url: https://terrytao.wordpress.com/2014/06/05/when-
is-correlation-transitive/, June 2014.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society: Series B, 58:267–288, 1996.
R. J. Tibshirani, J. Taylor, R. Lockhart, and R. Tibshirani. Exact Post-Selection Inference for
Sequential Regression Procedures. ArXiv e-prints, Jan. 2014.
J. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions
on Information Theory, 50(10):2231–2242, Oct 2004. ISSN 0018-9448.
S. van de Geer, P. Bhlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence
regions and tests for high-dimensional models. Annals of Statistics, 42(3):1166–1202, 2014.
H. Wang. Forward regression for ultra-high dimensional variable screening. Journal of the
American Statistical Association, 104:488:1512–1524, 2009.
H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica, 48(4):817–838, 1980.
C.-H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high-dimensional
linear regression. Annals of Statistics, 36(4):1567–1594, 2008.
Damian Kozbur/ATBFMS 30
C.-H. Zhang and S. S. Zhang. Confidence intervals for low dimensional parameters in high di-
mensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 76(1):217–242, 2014.
T. Zhang. On the consistency of feature selection using greedy least squares. Journal of Machine
Learning, 10:555–568, 2009.
T. Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations.
IEEE Transactions on Information Theory, 57(7):4689–4708, 2011.
Damian Kozbur/ATBFMS 1
S. Supplement to “Analysis of Testing-Based Forward Model Selection”
This supplement proves Theorems 2 and 3, supporting lemmas for Theorems 1 and 4, and
Theorem 5.
S.1. Proof of Theorems 2 and 3
Theorem 2 follows by applying Theorem 1 in the following way. If s grows faster than s0, then
there is m < s such that s0 < m < Kn and m/s0 exceeds c′F(Kn) = O(1), giving a contradiction.
The first statement of the theorem follows from applying the bound on s. Theorem 3 follows by
‖θ0 − θ‖1 6√s+ s0‖θ0 − θ‖2 6
√s+ s0ϕmin(s+ s0)(G)−1En[(x′iθ0 − x′iθ)2]1/2.
S.2. Proof of Lemmas 3 and 4
S.2.1. Proof of Lemma 3
It was already shown that `(θ) 6 `(θ0)+s0tϕmin(s+s0)(G)−1. Expanding the above two quadrat-
ics in `(·) gives
En[(x′iθ0 − x′iθ)2] 6 |2En[εix′i(θ − θ0)]|+ s0tϕmin(s+ s0)(G)−1
6 2‖En[εixi]‖∞‖θ0 − θ‖1 + s0tϕmin(s+ s0)(G)−1.
To bound ‖θ0 − θ‖1:
‖θ0 − θ‖1 6√s+ s0‖θ0 − θ‖2
6√s+ s0ϕmin(s+ s0)(G)−1/2En[(x′iθ0 − x′iθ)2]1/2.
If En[(x′iθ0−x′iθ)2]1/2 = 0, then the first conclusion of Theorem 1 holds. Otherwise, combining
the above bounds and dividing by En[(x′iθ0 − x′iθ)2]1/2 gives
En[(x′iθ − x′iθ)2]1/2 6 2‖En[εixi]‖∞√s+ s0ϕmin(s+ s0)(G)−1/2
+s0tϕmin(s+ s0)(G)−1
En[(x′iθ0 − x′iθ)2]1/2.
Finally, either En[(x′iθ0 − x′iθ)2]1/2 6√s0tϕmin(s+ s0)(G)−1/2, in which case Lemma 3 holds,
or alternatively En[(x′iθ0 − x′iθ)2]1/2 >√s0tϕmin(s+ s0)(G)−1/2, in which case
En[(x′iθ − x′iθ)2]1/2 6 2‖En[εixi]‖∞√s+ s0ϕmin(s+ s0)(G)−1/2
+√s0tϕmin(s+ s0)(G)−1.
S.2.2. Proof of Lemma 4
For any S define θ∗S to be the minimizer of E(S). For any S define also dS = θ∗S − θ∗S0∪S . Finally,
let δ0,S = θ0 − θ∗S0∪S . Note that E(S) − E(S0 ∪ S) = d′SE[G]dS . By arguments in the earlier
Damian Kozbur/ATBFMS 2
sections, d′S
E[G]dS 6 s0ctestϕmin(Ktest)(E[G])−1. But d′S
E[G]dS > ϕmin(Ktest)(E[G])‖dS‖22. So
‖dS‖2 6√s0ctestϕmin(Ktest)(E[G])−1. In addition, δ0,S is bounded by
‖δ0,S‖2 = ‖E[En[x′iS0∪Sεi]]‖2
6 (|S|+ s0)1/2 maxj|E[En[xijε
ai ]]| 6
1
2
√(|S|+ s0)ctestϕmin(Ktest)(E[G])−1
where the last bound comes from Cauchy-Schwarz (passing to E[En[x2ij ]]
1/2E[En[εa2i ]]1/2) along
with the assumed condition on εai and the fact that c′test 6 ctest. Next,
θ = G−1
SEn[xiS(x′
iSθ∗S
+ εi − x′iS∪S0dS + x′
iS∪S0δ0,S)]
= θ∗S
+G−1
SEn[xiSεi
]+G−1
SEn[xiSx
′iS∪S0
(−dS + δ0,S)]
⇒ ‖θ − θ∗S‖2 6 ϕmin(s)(G)−1/2
∥∥En[xiSεi]∥∥
2+∥∥∥G−1
SEn[xiSxiS∪S0
(−dS + δ0,S)]∥∥∥
2
6 ϕmin(s)(G)−1/2s1/2‖En[xiεi]‖∞+ ϕmin(s)(G)−1/2ϕmax(s+ s0)(G)1/2(‖dS‖2 + ‖δ0,S‖2).
Finally,
(En[(x′iθ − x′iθ0)2])1/2 6 ϕmax(s0 + s )(G)1/2‖θ − θ0‖26 ϕmax(s0 + s )(G)1/2(‖θ − θ∗
S‖2 + ‖δ0‖2 + ‖dS‖2)
6 ϕmax(s0 + s)(G)1/2ϕmin(s0 + s)(G)−1/2s1/2 ‖En[xiεi]‖∞
+ ϕmax(s0 + s )(G)1/2(3
2+
3
2ϕmax(s0 + s )(G)1/2ϕmin(s+ s0)(G)−1/2)
×√
(s+ s0)ctestϕmin(Ktest)(E[G])−1
6 ϕmax(s0 + s)(G)1/2ϕmin(s0 + s)(G)−1/2s1/2 ‖En[xiεi]‖∞+ 3ϕmax(s0 + s )(G)ϕmin(s+ s0)(G)−1/2
√(s+ s0)ctestϕmin(Ktest)(E[G])−1.
S.3. Proof of Supporting Lemmas for Sparsity Bounds for Theorems 1 and 4.
S.3.1. Additional notation
Additional notation is used for the proof of the lemmas which follow. The inner product from H
is hereafter denoted simply with 〈 · , · 〉H = 〈 · , · 〉. The symbol ′ is kept for use for transposition
of finite dimensional real matrices and vectors derived from certain elements of H defined below.
For a, b ∈ L2(Ω : Rn), a′b is defined pointwise (over Ω) and thus defines a random variable
Ω→ R and 〈a, b〉 = E[a′b]. In the case of Theorem 1, a′b = 〈a, b〉.Let V = [v1, ..., vs0 ] with the understanding that V and similar quantities are formally defined
as linear mappings Rs0 → H. Then y = V θ0 + ε is well defined for both Theorems 1 and 4.
Let Mk denote projection in H onto the space orthogonal to span(v1, ..., vk). Then note that
vk = Mk−1vk〈vk,Mk−1vk〉1/2
for k = 1, ..., s0. In addition, ε =Ms0ε
(〈ε,Ms0ε〉)1/2. For more general sets S, let
QS be projection onto the space orthogonal to span(xj , j ∈ S). For each selected covariate,
wj , set Spre-wj to be the set of (both true and false) covariates selected prior to wj .
Damian Kozbur/ATBFMS 3
S.3.2. Proof of Lemma 5
It is needed to calculate C1, C2 such that γ′j θ > θkC1 for j ∈ A1k and θk > θlC2 for l > k. Define
∆j`H(S) =
∆j`(S) in the case of Theorem 1
∆jE(S) in the case of Theorem 4
Also recall that tH = t in the case of Theorem 1 and tH = c′test in the case of Theorem 4. Note
that ctest′′ is not defined in the context of Theorem 1. In the case of Theorem 1, during the proof
of this lemma, c′′test is taken to be equal to 1.
A simple derivation can be made to show that
−∆j`H(Spre-wj ) =
1
n〈y, wj〉(〈wj , wj〉)−1〈wj , y〉 =
1
n
1
‖wj‖2H(θ′γj + θεγjε)
2.
Note the slight abuse of notation in −∆j`H(Spre-wj ) signifying change in loss under inclusion of
wj rather than xj . Next,
(θ′γj + θεγjε)2 6 2(θ′γj)
2 + 2(θεγjε)2.
Since θε =〈ε, y〉 = 〈ε,Ms0y〉/〈ε,Ms0ε〉1/2 = 〈ε,Ms0ε〉1/2, ‖wj‖2H > 1, and j ∈ A1 it follows that
1
n
1
‖wj‖2H(θεγjε)
2 61
n
1
‖wj‖2Hθ2ε
(t1/2H n1/2
(3〈ε,Ms0ε〉)1/2
)2
6tH3.
This implies1
2(−∆j`
H(Spre-wj )) 61
n
1
‖wj‖2H(θ′γj)
2 +tH3.
By the condition that the false j is selected, it holds that −∆j`H(Spre-wj ) > tH and so
13 (−∆j`
H(Spre-wj )) >tH3 which implies that − tH3 > 1
3∆j`H(Spre-wj ) and
1
2(−∆j`
H(Spre-wj ))−tH3
>1
6(−∆j`
H(Spre-wj )).
Finally, this yields that1
n‖wj‖2H(γ′j θ)
2 >1
6(−∆j`
H(Spre-wj )).
By the fact that wj was selected ahead of vk it holds that
−∆j`H(Spre-wj ) > −∆k`
H(Spre-wj )c′′test.
Next, to lower bound −∆k`H(Spre-wj ), define a perturbed version of `H. Let ξ ∈ H. Let `Hy+ξ
be defined analogously to `H except with the role of y in ` played by y + ξ in `Hy+ξ. Choose ξ
such that 〈ξ, wj〉 = 0 for j = 1, ...,m, 〈ξ, vk〉 = 0 for vk = 1, ..., s0 and 〈ξ, ε〉 = 0. In the case of
Theorem 1, ξ 6= 0 exists provided m+ s0 + 1 < n. If not, then H can be enlarged appropriately
to allow ξ to exist, for example, with the inclusion ι : H → H ⊕ R, x 7→ (x, 0), ξ = (0, 1). Then
due to the orthogonality of ξ to wj and vk and ε, it follows that
−∆k`H(Spre-j) = −∆k`
Hy+ξ(Spre-j)
Damian Kozbur/ATBFMS 4
with the right hand side possibly defined on an enlarged H as described above.
Next, the following reduction holds.
−∆k`Hy+ξ(Spre-wj ) > −∆k`
Hy+ξ(Spre-wj ∪ vk+1θk+1 + ...+ vs0 θs0 + ε+ ξ).
= −∆vk`Hy+ξ(Spre-wj ∪ vk+1θk+1 + ...+ vs0 θs0 + ε+ ξ).
Let↔ξ
Mk be projection on the corresponding orthogonal space to the span of the vectors listed in
Spre-wj ∪ vk+1θk+1 + ... + vs0 θs0 + ε + ξ. (The accent↔· meant to emphasize that covariates
selected before and after vk (or not at all) are considered.) Then the above term is further
reduced by
=1
n
〈(y + ξ),↔ξ
Mkvk〉2
〈vk,↔ξ
Mkvk〉=
1
n
〈θkvk,↔ξ
Mkvk〉2
〈vk,↔ξ
Mkvk〉=
1
nθ2k〈vk,
↔ξ
Mkvk〉.
Then seek a lower bound on 1n 〈vk,
↔ξ
Mkvk〉. Note that for some vector ηk it holds that vk =
〈vk,Mk−1vk〉−1/2vk − [v1, ..., vk−1]ηk. Then 〈vk,↔ξ
Mkvk〉 = 〈vk,Mk−1vk〉−1〈vk,↔ξ
Mkvk〉. Let H =
[V W ]. Let yk = vk+1θk+1 + ...+ vs0 θs0 + ε. A lower bound on the term 〈vk,↔ξ
Mkvk〉 follows from
a lower bound on the eigenvalues of the below matrix for any c > 0:
〈vk,↔ξ
Mkvk〉 > λmin (〈[H (yk + ξ)c] , [H (yk + ξ)c]〉)
That is, it is enough to bound the spectrum of nGc,ξ defined by
Gc,ξ =1
n
[〈H,H〉 c〈yk + ξ,H〉
〈H, yk + ξ〉c c2〈yk + ξ, yk + ξ〉
].
Using the fact that ξ is orthogonal to H and ε, Gc,ξ reduces to
Gc,ξ =1
n
[〈H,H〉 c〈yk, H〉〈H, yk〉c c2〈yk, yk〉+ c2〈ξ, ξ〉
].
As a result of the above reductions, for each c, ξ,
−∆k`H(Spre-wj ) >
1
n〈vk,Mk−1vk〉−1nλmin(Gc,ξ)θ
2k.
And therefore,
−∆k`H(Spre-wj ) >
1
n〈vk,Mk−1vk〉−1nθ2
k lim c→ 01n 〈ξ, ξ〉 = c−2
λmin(Gc,ξ).
By continuity of eigenvalues for symmetric matrices, passing to the limit gives
−∆k`H(Spre-wj ) >
1
n〈vk,Mk−1vk〉−1nθ2
k λmin
(1
n
[〈H,H〉 0
0 1
])
Damian Kozbur/ATBFMS 5
>1
n〈vk,Mk−1vk/n〉−1θ2
kϕmin(m+ s0)(GH) >1
n· 1 · θ2
kϕmin(m+ s0)(GH).
This gives1
n‖wj‖2H(γ′j θ)
2 > c′′test
1
6
1
nϕmin(m+ s0)(GH)θ2
k.
Using the fact that ‖wj‖H > 1 implies that
(γ′j θ)2 > θ2
k c′′test
1
6ϕmin(m+ s0)(GH).
Now suppose no true variables remain when j is selected. Then 〈wj , wj〉 = 〈uj , uj〉 = 1 and
−∆j`H(Spre-wj ) =
1
nγ2jεθ
2ε > tH.
Note that θε is given by θε = 〈ε,Ms0ε〉1/2. Therefore, γ2jε > tH
n〈ε,Ms0
ε〉 . This implies that j ∈ A2.
Therefore, set
C21 = c′′test
1
6ϕmin(m+ s0)(GH).
Next, construct C2. For each selected true covariate, vk, set Spre-vk to be the set of (both true
and false) covariates selected prior to vk. Note that
1
nθ2k = −∆k`
H(v1, ..., vk−1) > −∆k`H(Spre-vk)
since v1, ..., vk−1 ⊆ Spre−vk . In addition, if vk is selected before vl (or vl is not selected), then
−∆k`H(Spre-vk) > c′′test(−∆l`
H(Spre-vk)) > c′′testϕmin(m+ s0)(GH)1
nθ2l .
Therefore, taking
C22 = c′′testϕmin(m+ s0)(GH)
implies that θk > θlC2 for any l > k.
As a final remark, consider the case that θk = 0. Then θl = 0 for l > k. Then if j ∈ A1k, it
follows that γ′j θ = 0. Therefore, using reasoning as above, −∆j`H(Spre-j) = 1
n1
‖wj‖2H(θεγjε)
2 6 tH3 .
But this is impossible, because being selected into the model requires −∆j`H(Spre-j) > tH.
Therefore, A1k is empty if θk = 0.
S.3.3. Proof of Lemma 6
The desired element Z of Gs0 is constructed as the covariance matrix of certain real, mean-zero,
random vectors
X = (Xk)s0k=1 , Y = (Yl)
s0l=1 .
The random variables Xk,Yl constituting X,Y are defined as follows. Let βk = θk/θk−1 for
k = 2, ..., s0. Then note that the components of B can be expressed Bkl =∏lq=k+1 βq for k < l,
and extended symmetrically for components l < k.
Decompose the elements of the sequence βk into
βk = βakβbk
Damian Kozbur/ATBFMS 6
in such a way that for all l > k > 2,
C2 6l∏
q=k
βaq 6 C−12
and for all k > 2,
0 6 βbk 6 1.
Induction establishes the existence of such a decomposition with the additional property that:
βak > βk only if there is q 6 k such that βaq · ... · βak = C2. The case s0 = 2 follows by taking
βa2 = maxC2, β2 and noting that β2 = θ2/θ1 6 C−12 . Assume the complete induction hypothesis
that the decomposition exists for sequences with s0 = 2, ..., s for some s. Consider a sequence
β2, ..., βs+1. Apply the decomposition to obtain βk = βakβbk for k 6 s. The existence of the
decomposition fails at k = s+1 only if βs+1 > 1 and there is an index j such that βaj ·...·βas ·βs+1 >
C−12 . Then there must be an index o > j such that βao > βo as otherwise this contradicts
θs+1/θj−1 6 C−12 . If there are multiple such indeces o, then consider the largest one. There must
then also be an index q such that βaq · ... · βao = C2. There are two cases to consider: q < j and
q > j. Consider the first case q < j. In this case, the above conclusions can be visualized by:
>C−12︷ ︸︸ ︷
βaq · ... · βaj−1βaj · ... · βao︸ ︷︷ ︸
=C2
βo+1 · ... · βs+1︸ ︷︷ ︸6C−1
2︸ ︷︷ ︸61
.
These imply that βaq · ... · βaj−1 < C2 which contradicts the inductive hypothesis. The case q > j
is similar. This completes the inductive argument and therefore establishes the decomposition
βk = βakβbk, k = 2, ..., s0, for all s0.
Using the fact that βbk 6 1 for all k allows the definition of the following autoregressive
process. Let U1 ∼ N(0, 1) and let W1 = U1. Define Uk ∼ N(0, 1) independently of previous
random variables. Define Wk inductively as
Wk = βbk ·Wk−1 +√
1− (βbk)2 · Uk.
Note that E[W2k] = 1 and E[WkWl] =
∏lq=k+1 β
bq if k < l. Then set Xk,Yl as follows:
Xk = C2Wk
(k∏q=2
βaq
)−1/2 s0∏q=k+1
βaq
1/2
Yl = C2Wl
s0∏q=l+1
βaq
−1/2(l∏
q=2
βaq
)1/2
.
By construction,
E[XkYl] = C22Bkl, for k 6 l.
Damian Kozbur/ATBFMS 7
Next, note that E[X2k] 6 1 and E[Y2
l ] 6 1. This then implies (taking H1 to be the span of
U1, ...Us0 within the set of square integrable random variables) that both
E[XY′] ∈ Gs0 and E[XY′]′ ∈ Gs0 .
Take Z = E[XY′]′. Let C3 = C−22 . Note Γ is upper triangular due to the way γj are defined.
Because Γ is upper triangular, only lower triangular components of E[XY′]′ matter for computing
the product ΓC3Z. Using this fact and the above calculations gives the desired factorization
ΓB = ΓC3Z = ΓC3E[XY′]′.
S.3.4. Proof of Lemma 8
Collect the m1 false selections into W = [wj1 , ..., wjm1]. Set R = [rj1 , ..., rjm1
], U = [uj1 , ..., ujm1].
Decompose W = R+ U . Then 〈W , W 〉 = 〈R, R〉+ 〈U , U〉. Here, the objects 〈W , W 〉, 〈R, R〉 and
〈U , U〉 etc are formally defined as m1×m1 real matrices with k, l entry given by 〈wk, wl〉 〈rk, rl〉〈uk, ul〉 etc (which, note, are genuine inner products on H).
Next, by the above normalization diag(〈U , U〉) = I if 〈uj , uj〉 = 1 for all j ∈ A1. Recall that
this normalization is possible provided ϕmin(m+ s0)(GH) > 0. Since diag(〈U , U〉) = I, it follows
that the average inner product between the uj , given by
ρ =1
m1(m1 − 1)
∑j 6=l∈A1
〈uj , ul〉,
must be bounded below by
ρ > − 1
m1 − 1
due to the positive definiteness of 〈U , U〉. (This can be checked as a direct consequence of the fact
that 1′m1×1〈U , U〉1m1×1 > 0). This implies an upper bound on the average off-diagonal term in
〈R, R〉 since 〈W , W 〉 is a diagonal matrix. Since vk are orthonormal, the sum of all the elements
of 〈R, R〉 is given by ‖∑j∈A1
γj‖22. Since ‖∑j∈A1
γj‖22 =∑j∈A1
‖γj‖22 +∑j 6=l∈A1
γ′j γl and since
〈W , W 〉 is a diagonal matrix, it must be the case that
1
m1(m1 − 1)
∑j 6=l∈A1
γ′j γl = −ρ.
Therefore,
−ρ =1
m1(m1 − 1)
∥∥∥ ∑j∈A1
γj
∥∥∥2
2−∑j∈A1
‖γj‖22
61
m1 − 1.
This implies that ∥∥∥ ∑j∈A1
γj
∥∥∥2
26 m1 +
∑j∈A1
‖γj‖22.
Next, bound maxj∈A1 ‖γj‖22.
Damian Kozbur/ATBFMS 8
Note ‖γj‖22 = ‖rj‖2H since V is orthonormal. Note that ‖wj‖2H is upper bounded by ϕmin(m+
s0)(G)−1. To see this, note that ‖wj‖2H = ‖cjQpre-jwj‖2H 6 c2j‖wj‖2H = c2jn where cj is the nor-
malizing constant such that wj = cjQpre-j . At the same time, c2j satisfies ‖Ms0Qpre-jwj‖2H = c−2j
whenever wj /∈ span(V ). Note also that ‖Ms0Qpre-jwj‖2H > ‖QS0∪pre-jwj‖2H where the no-
tation QS0∪pre-j denotes projection onto the space orthogonal to covariates indexed in S0
or selected before wj . To see this, consider an arbitrary Hilbert space H, projections onto
closed subspaces 1, 2, 12 = span(1 ∪ 2) , P1,P2,P12, projections onto the respective orthog-
onal complements Q1,Q2,Q12, and any vector w. Then w = Q12w + P12w. Then Q2Q1w =
Q2Q1Q12w+Q2Q1P12w = Q12w+Q2Q1P12w. Note that the inner product between the above two
terms vanishes: 〈Q12w,Q2Q1P12w〉H = 〈w,Q12P12w〉H = 〈w, 0w〉H = 0. Then by Pythagorean
Theorem, ‖Q2Q1w‖2H = ‖Q12w‖2H + ‖Q2Q1P12w‖2H > ‖Q12w‖2H. So ‖Q12w‖H 6 ‖Q2Q1w‖H.Therefore, the quantity ‖QS0∪pre-jwj‖2H is lower bounded by nϕmin(m + s0)(GH). As a result,
c2j 6 ϕmin(m+ s0)(GH)−1, giving the desired bound on ‖wj‖2H. Therefore, ‖rj‖2H = ‖wj‖2H − 1 6
ϕmin(m+ s0)(GH)−1 − 1. It follows that
maxj∈A1
‖γj‖22 6 ϕmin(m+ s0)(GH)−1 − 1.
This then implies that ∥∥∥ ∑j∈A1
γj
∥∥∥2
26 m1ϕmin(m+ s0)(GH)−1.
The same argument as above also shows that for any choice ej ∈ −1, 1 of signs, it holds that∥∥∥ ∑j∈A1
ej γj
∥∥∥2
26 m1ϕmin(m+ s0)(GH)−1.
(In more detail, take We = [wj1ej1 , ..., wjm1ejm1
], etc. and rerun the same argument.)
S.3.5. Proof of Lemma 10
In this proof, the number of elements of A2 is bounded. Recall that the criteria for j ∈ A2 is
that |γjε| >t1/2H n1/2
(3〈ε,Ms0ε〉)1/2 . Note also that γjε is found by the coefficient in the expression
γjε = 〈ε, wj〉 = 〈ε, 1
〈ε,Ms0ε〉1/2Ms0wj〉.
Next, let H be H = [v1, ..., vs0 , w1, ..., wm]. Note that
1
〈ε,Ms0ε〉1/2Ms0wj ∈ span(H),
Therefore,
γjε = 〈ε,H〉〈H,H〉−1〈H, 1
(〈ε,Ms0ε〉)1/2Ms0wj〉.
Let µj be the sign +1 for each j ∈ A2 such that γjε > 0 and −1 for each j ∈ A2 such that
γjε < 0. By the fact that j ∈ A2, γjεµj >t1/2H n1/2
(3〈ε,Ms0ε〉)1/2 , summing over j ∈ A2 gives
∑j∈A2
〈ε,H〉〈H,H〉−1〈H, 1
(〈ε,Ms0ε〉)1/2Ms0wjµj〉 > m2
t1/2H n1/2
(3〈ε,Ms0ε〉)1/2.
Damian Kozbur/ATBFMS 9
This implies that
∥∥∥〈H,H〉−1〈H, 1
(〈ε,Ms0ε〉)1/2
∑j∈A2
Ms0wjµj〉∥∥∥
1‖〈ε,H〉‖∞ > m2
t1/2H n1/2
(3〈ε,Ms0ε〉)1/2
Which further implies that
√m+ s0
∥∥∥〈H,H〉−1〈H, 1
(〈ε,Ms0ε〉)1/2
∑j∈A2
Ms0wjµj〉∥∥∥
2‖〈ε,H〉‖∞ > m2
t1/2H n1/2
(3〈ε,Ms0ε〉)1/2
Next, further upper bound the ‖ · ‖2 term on the left side above by∥∥∥〈H,H〉−1〈H, 1
(〈ε,Ms0ε〉)1/2〉∑j∈A2
Ms0wjµj〉∥∥∥
2
6n−1/2
(〈ε,Ms0ε〉)1/2ϕmin(s0 +m)(GH)−1/2‖Ms0
∑j∈A2
wjµj‖H.
Next, by the fact that Ms0 is a projection (hence non-expansive) and wj are mutually orthogonal,
6n−1/2
(〈ε,Ms0ε〉)1/2ϕmin(s0 +m)(GH)−1/2
√∑j∈A2
‖wjµj‖2H.
Earlier, it was shown that maxj ‖wj‖2H 6 ϕmin(s0 + m)(GH)−1. Therefore, putting the above
inequalities together,
n−1/2
(〈ε,Ms0ε〉)1/2
√m+ s0ϕmin(m+ s0)(GH)−1√m2‖〈ε,H〉‖∞ > m2
t1/2H n1/2
(3〈ε,Ms0ε〉)1/2.
This implies that
m2 <1
n2
3
tH(〈ε,Ms0ε〉)(m+ s0)
‖〈ε,H〉‖2∞ε′Ms0ε
ϕmin(m+ s0)(GH)−2
In the case of Theorem 1, this is further bounded by
6 3(m+ s0)‖En[xiεi]‖2∞
tϕmin(m+ s0)(G)−2.
under the assumed condition that t1/2 > 2‖En[xiεi]‖∞ϕmin(m+ s0)(G)−1, it follows that
m2 63
4(m+ s0).
Similarly, the condition of Theorem 4 that E[En[εa2i ]] 6 1
2ϕmin(E[G])−1c′test) yields m2 6 34 (m+
s0) in the same way. Finally, substituting m = m1 +m2 gives
m2 6 3m1 + 3s0.
Damian Kozbur/ATBFMS 10
S.3.6. Proof of Lemma 11
Combining m1 6 ϕmin(m+ s0)(GH)−1C−21 C3
2(KRG
)2s0 and m2 6 3(m1 + s0) gives
m 6[4ϕmin(m+ s0)(GH)−1C−2
1 C32(KRG
)2+ 3]s0.
In addition, in the case of Theorem 1, C21 = 1
6ϕmin(m+s0)(GH), C22 = ϕmin(m+s0)(GH), C2
3 =
(C−22 )2 = ϕmin(m + s0)(GH)−2, C−2
1 C23 = 6ϕmin(m + s0)(GH)−3 and KR
G < 1.783. Therefore,
m 6 (3 + 24× 1.7832 ×ϕmin(m+ s0)(GH)−4)s0. Because ϕmin(m+ s0)(GH)−1 > 1 and 3 + 24×1.7832 = 79.2981 < 80, it holds that
m 6 80× ϕmin(m+ s0)(GH)−4s0.
This bound holds for each positive integer m of wrong selections, provided t1/2 > 2ϕmin(m +
s0)(G)−1‖En[xiεi]‖∞. This concludes the proof of the sparsity bound for Theorem 1. Using
similar reasoning in the case of Theorem 4, on the event T, it follows that m 6 80× ϕmin(m +
s0)(GH)−4c′′−3test s0 provided E[En[εa2
i ]] 6 12ϕmin(m + s0)(E[G])−1c′test. Setting m = Ktest − s0
contradicts Condition 2 by Ktest 6 80 × ϕmin(Ktest)(E[G])−4c′′test−3
+ s0 < Ktest. Therefore,
m < Ktest − s0 and thus
s 6 (80× ϕmin(Ktest)(GH)−4c′′−3test + 1)s0,
completing the proof of the sparsity bound for Theorem 4.
S.4. Proof of Theorem 5
The strategy is to apply Theorem 4 using the conditional distribution Px for Dn, conditional on
x. The unconditional result is then shown to follow. Let Ex(S) = E[`(S)|x]. In addition, for j /∈ Slet θ
∗|xjS = (x′jSxjS)−1x′jSE[x′jS(xθ0 + εa)|x] so that [θ
∗|xjS ]j = (x′jQSxj)
−1E[x′jQS(xθ0 + εa)|x].
Throughout the proof of Theorem 5, use an abuse of notation by writing VjS = [VjS ]jj . Let
ZjS = V−1/2jS ([θjS ]j − [θ
∗|xjS ]j).
Let tα = Φ−1(1− α/p). Let A be the event given by
A =
|ZjS | 6
(1 + cτ
2
)τjStα for all j, |S| < Kn
.
Note that −∆jEx(S) = [θ∗|xjS ]2jAjS for AjS defined by AjS = [G−1
jS ]jj .
The next lemma states size, power, and continuity properties of the tests of Definition 1.
Lemma 12. The following implications are valid on A for all j, |S| < Kn:
1. TjSα = 1 if −∆jEx(S) > AjS VjS(2cτ )2τ2jSt
2α.
2. −∆jEx(S) > AjS VjS(
1−cτ2
)2τ2jSt
2α if TjSα = 1.
3. −∆kEx(S) 6 VkSAkSVjSAjS
(1 + 1+cτ
cτ−1
(1 + τkS
τjS
))2
(−∆jEx(S)) if TjSα = 1, WjS >WkS.
Damian Kozbur/ATBFMS 11
Next define a sequence of sets X = Xn which will be shown to have the property that both
P(x ∈ X)→ 1 and
PX(A) = ess infx∈X
P(A|x)→ 1.
In addition, there will be constants ctest, c′test, c
′′test > 0 which are independent of n and the
realization of x, such that for ctest = 1n ctest, c
′test = 1
n c′test and for the set B defined by
B =
1. AjS VjS(2cτ )2τ2jSt
2α 6 ctest
2. AjS VjS(
1−cτ2
)2τ2jSt
2α > c′test |S| < Kn
3. AkS VkSAjS VjS
(1 + 1+cτ
1−cτ
(1 + τkS
τjS
))2
> c′′test
it holds that PX(B)→ 1.
Define sets X = Xn as follows. Set X = X1 ∩ X2 ∩ X3 ∩ X4 with
X1 = x : maxj6p En[x12ij ] = O(1)
X2 = x : ϕmin(Kn)(G)−1 = O(1)X3 = x : maxj,|S|<Kn‖ηjS‖1 = O(1)X4 = x : P(ϕmin(Kn)(En[ε2
ixix′i])−1 = O(1)|x) = 1− o(1).
Note that P(X1),P(X2),P(X3) → 1 by assumption. In addition, failure of P(X4) → 1
would contradict the unconditional statement in Condition 4 that P(ϕmin(Kn)(En[ε2ixix
′i])−1 =
O(1)) = 1− o(1). Therefore, P(X)→ 1.
The next two sections prove the following two lemmas.
Lemma 13. PX(A)→ 1.
Lemma 14. PX(B)→ 1 for some ctest, c′test, c
′′test as described in the definition of B above.
The previous results show that for each n, Theorem 4 can be applied conditionally on x with
ctest, c′test, c
′′test defined above, with Ktest = Kn − 1, and with 1 − α − δtest = PX(A ∩ B). Note
that renormalizing the covariates to satisfy En[x2ij ] = 1 does not affect Ex(S) and therefore does
not affect the conclusions above. Moreover, on X, renormalizing does not affect boundedness of
sparse eigenvalues of G. The unconditional result is shown as follows. By Theorem 4,
PX(En[(x′iθ
∗|xS0− xiθ)2]1/2 6 O(
√s0 log p/n)
)→ 1.
Note that θ∗|xS0− θ0 = (x′S0
xS0)−1x′S0
E[εa|x]. As a result,
‖θ0 − θ∗|xS0‖2 6 ϕmin(s0)(G)−1/2‖En[xis0E[εa
i |x]]‖2 6 ϕmin(s0)(G)−1/2√s0‖En[xijE[εai |x]]‖∞.
By the assumed rate conditions, sparse eigenvalue conditions, and by maxi E[εai ] = O(n−1/2), the
bound on ‖θ0 − θ∗|xS0‖2 implies further that PX
(En[(x′iθ
∗|xS0− xiθ0)2]1/2 6 O(
√s0 log p/n)
)→ 1.
Theorem 5 follows by triangle inequality.
Damian Kozbur/ATBFMS 12
S.5. Proof of Supporting Lemmas for Theorem 5
S.5.1. Proof of Lemma 12
For this proof, work on A and suppose |S| < Kn. To prove the first statement, suppose that
−∆jEx(S) > AjS VjS(2cτ )2τ2jSt
2α. Then
[θ∗|xjS ]2jAjS > AjS VjS(2cτ )2τ2
jSt2α
|[θ∗|xjS ]j | > V1/2jS (2cτ )τjStα
|[θjS ]j | > V1/2jS (2cτ )τjStα − |[θ∗|xjS ]j − [θjS ]j |
|[θjS ]j | > V1/2jS (2cτ )τjStα − V 1/2
jS
(1 + cτ
2
)τjStα
|[θjS ]j | > V1/2jS cτ τjStα
which implies TjSα = 1.
Next, prove the second statement. By construction, if TjSα = 1 then |V −1/2jS [θjS ]j | > cτ τjStα,
which is equivalent to
|[θjS ]j | > cτ τjStαV1/2jS .
Note that |[θjS ]j − [θ∗|xjS ]j | 6 V
1/2jS
(1+cτ
2
)τjStα. Then TjSα = 1⇒
|[θ∗|xjS ]j | > cτ τjStαV1/2jS − V
1/2jS
(1 + cτ
2
)τjStα = V
1/2jS τjStα
(cτ − 1
2
)Thefore −∆jEx(S) > AjS VjS τ
2jSt
2α
(cτ−1
2
)2.
Finally, prove the third statement. Note that WkS 6 WjS implies V−1/2kS |[θkS ]k| 6
V−1/2jS |[θjS ]j |. Then
V−1/2kS |[θ∗|xkS ]k| −
(1 + cτ
2
)τkStα 6 V
−1/2jS |[θ∗|xjS ]k|+
(1 + cτ
2
)τjStα
⇒ V−1/2kS |[θ∗|xkS ]k| 6 V
−1/2jS |[θ∗|xjS ]j |+
(1 + cτ
2
)(τkS + τjS)tα
⇒ V−1/2kS A
−1/2kS (−∆kEx(S))1/2 6 V
−1/2jS A
−1/2jS (−∆jEx(S))1/2 +
(1 + cτ
2
)(τkS + τjS)tα
= V−1/2jS A
−1/2jS (−∆jEx(S))1/2 +
(1 + cτ
2
)(τkS + τjS)tα
(AjS VjS
(1−cτ
2
)2τ2jSt
2α
AjS VjS(
1−cτ2
)2τ2jSt
2α
)1/2
.
Using the fact that −∆jEx(S) > AjS VjS(
1−cτ2
)2τ2jSt
2α (because TjSα = 1), gives that the
previous expression is bounded by
6 V−1/2jS A
−1/2jS (−∆jEx(S))1/2 +
(1+cτ
2
)(τkS + τjS)tα(
AjS VjS(
1−cτ2
)2τ2jSt
2α
)1/2(−∆jEx(S))1/2
Damian Kozbur/ATBFMS 13
= V−1/2jS A
−1/2jS
(1 +
1 + cτcτ − 1
τkS + τjSτjS
)(−∆jEx(S))1/2.
This gives −∆kEx(S) 6 VkSAkSVjSAjS
(1 + 1+cτ
cτ−1
(1 + τkS
τjS
))2
(−∆jEx(S)).
S.5.2. Proof of Lemma 13
Note that
ZjS = V−1/2jS ([θjS ]j − [θ
∗|xjS ]j)
= V−1/2jS (x′jQSxj)
−1x′jQS(ε− E[ε|x])
= ((x′jQSxj)−1En[ε2
ijS [QSxjS ]2i ](x′jQSxj)
−1)−1/2(x′jQSxj)−1x′jQS(ε− E[ε|x])
= En[ε2ijS [QSxjS ]2i ]
−1/2x′jQS(ε− E[ε|x])
= En[ε2ijS(η′jSxijS)2]−1/2η′jSxjS(ε− E[ε|x]).
= En[ε2ijS(η′jSxijS)2]−1/2η′jSxjS(εo + εa − E[εa|x]).
Let ε = εo + εa − E[εa|x]. Define the Regularization Event by
R =
|∑ni=1 xikεi|√∑ni=1 x
2ikε
2i
6 tα for every k 6 p
In addition, define the Variability Domination Event by
V =
n∑i=1
x2ikε
2i 6
(1 + cτ
2
)2 n∑i=1
x2ikε
2ijS for every k ∈ jS, for every |S| < Kn
The definition of the Regularization Event and the Variability Domination Event are useful
because
R ∩ V⇒ A.
To see this, note that on R, the following inequality holds for any conformable vector ν: n∑i=1
∑k∈jS
νkxikεi
2
6
tα ∑k∈jS
|νk|
√√√√ n∑i=1
x2ikε
2i
2
Furthermore, on V, the previous expression can be further bounded by
6
(1 + cτ
2
)2tα ∑
k∈jS
|νk|
√√√√ n∑i=1
x2ikε
2ijS
2
=
(1 + cτ
2
)2
(tα∑k∈jS |νk|
√∑ni=1 x
2ikε
2ijS
)2
∑ni=1
(∑k∈jS νkxik
)2
ε2ijS
n∑i=1
∑k∈jS
νkxik
2
ε2ijS
Damian Kozbur/ATBFMS 14
=
(1 + cτ
2
)2
t2α‖ν′Diag(Ψε
jS)1/2‖21ν′Ψε
jSν
n∑i=1
∑k∈jS
νkxik
2
ε2ijS .
Specializing to the case that ν = ηjS , and using τjS =‖ν′Diag(ΨεjS)1/2‖1√
ν′ΨεjSνgives that
|ZjS | 61 + cτ
2τjStα on R ∩ V.
It is therefore sufficient to prove that R and V have probability → 1 under PX. PX(R) → 1
follows immediately from the moderate deviation bounds for self-normalized sums given in Jing
et al. (2003). For details on the application of this result, see Belloni et al. (2012).
Therefore, it is only left to show that PX(V) → 1. Define εijS = yi − x′ijSθ∗|xjS . Furthermore,
define ξijS through the decomposition εijS = εi+ ξijS . Let εjS and ξjS be the respective stacked
versions. Let cτ = ((1 + cτ )/2)2. Then
cτ
n∑i=1
x2ikε
2ijS = cτ
[n∑i=1
x2ik(ε2
ijS − ε2ijS) +
n∑i=1
x2ikε
2i + 2
n∑i=1
x2ikεiξijS +
n∑i=1
x2ikξ
2ijS
]
> cτ
[n∑i=1
x2ik(ε2
ijS − ε2ijS) +
n∑i=1
x2ikε
2i + 2
n∑i=1
x2ikεiξijS
]
=
n∑i=1
x2ikε
2i + cτ
n∑i=1
x2ik(ε2
ijS − ε2ijS) +
(cτ − 1)
2
n∑i=1
x2ikε
2i
+ 2cτ
n∑i=1
x2ikεiξijS +
(cτ − 1)
2
n∑i=1
x2ikε
2i .
Define the two events
V′ =
cτEn[x2
ik(ε2ijS − ε2
ijS)] +(cτ − 1)
2En[x2
ikε2i ] > 0 for all j, k 6 p, |S| < Kn
V′′ =
2cτEn[x2
ikεiξijS ] +(cτ − 1)
2En[x2
ikε2i ] > 0 for all j, k 6 p, |S| < Kn
Therefore V′ ∩ V′′ ⇒ V.
Note that En[x2ikε
2i ] > 1
2En[x2ikε
2i ] − En[x2
ikE[εai |x]] > 1
2En[x2ikε
2i ] −
maxi6n E[εa2i |x]1/2En[x4
ik]1/2. This is bounded below with PX → 1 by a positive constant
independent of n. Therefore, to show that PX(V′) → 1, PX(V′′) → 1, it suffices to show
En[x2ik(ε2
ijS − ε2ijS)] and En[x2
ikεiξijS ], respectively, are suitably smaller order.
First consider En[x2ik(ε2
ijS − ε2ijS)]. It is convenient to bound the slightly more general sum
En[xikxil(ε2ijS − ε2
ijS)], because this will show up again.
En[xikxil(ε2ijS − ε2
ijS)] = 2En[xikxilεijSx
′ijS(θ
∗|xjS − θjS)
]+ En
[xikxil(x
′ijS(θ
∗|xjS − θjS))2
]6 2‖En[xikxilεijSx
′ijS ]‖2‖θ∗|xjS − θjS‖2 + λmaxEn[xikxilxijSx
′ijS ]‖θ∗|xjS − θjS)‖22
Damian Kozbur/ATBFMS 15
Standard reasoning gives that ‖θ∗|xjS − θjS‖2 6 ϕmin(Kn)(G)−1/2√Kn‖EnxijSεijS‖∞. Therefore,
the bound continues
6 2‖En[xikxilεijSx′ijS ]‖2ϕmin(Kn)(G)−1/2
√Kn‖EnxijSεijS‖∞
+λmaxEn[xikxilxijSx′ijS ]ϕmin(Kn)(G)−1Kn‖EnxijSεijS‖2∞.
Note that λmaxEn[xikxilxijSx′ijS ] 6 Kn maxj6p En[x4
ij ].
6 2‖En[xikxilεijSx′ijS ]‖2ϕmin(Kn)(G)−1/2
√Kn‖EnxijSεijS‖∞
+K2n maxj6p
En[x4ij ]ϕmin(Kn)(G)−1‖EnxijSεijS‖2∞.
An application of Cauchy-Schwarz to the top line gives
6 2√Kn max
jEn[x4
ik]1/2 maxj,S
En[ε2ijSx
2ij ]
1/2ϕmin(Kn)(G)−1/2√Kn‖EnxijSεijS‖∞
+K2n maxj6p
En[x4ij ]ϕmin(Kn)(G)−1‖EnxijSεijS‖2∞.
Next, ‖EnxijSεijS‖∞ and En[ε2ijSx
2ij ]
1/2 are bounded using εijS = εi −E[εi|x] + ξijS . Note that
by construction ‖En[xijSξijS ]‖∞ = 0. Then
‖En[xijSεijS ]‖∞ 6 ‖En[xiεi]‖∞ + ‖En[xiE[εai |x]‖∞
6 ‖En[xiεi]‖∞ + maxj6p
En[x2ij ]
1/2En[E[εai |x]2]1/2 = O(
√log p/n)
with PX → 1. Next,
En[ε2ijSx
2ij ] 6 3En[ε2
ix2ij ] + 3En[E[εa2
i |x]x2ij ] + 3En[ξ2
ijSx2ij ]
6 3En[ε2ix
2ij ] + 3En[x2
ij ] maxi6n
E[εa2i |x] + 3En[ξ4
ijS ]1/2En[x4ij ]
1/2.
Next,(En[ξ4
ijS ])1/2
6 O(1)s20 on X1 ∩ X3. To see this, note ξjS = QjSxθ0 =
∑s0l=1 QjSxlθ0,l =∑s0
l=1 ηl,(jS)xljS = ηjSxS0∪jS for some new linear combination ηjS . Note that ‖ηjS‖1 6 s0O(1).
Then(En[ξ4
ijS ])1/4
6 ‖ηjS‖1 maxk6p En[x4ik]1/4 from which the bound follows.
Next consider En[x2ikεiξijS ]. Consider two cases. In Case 1,
En[x4ikξ
2ijS ]1/2 6
1
2cτ
(cτ − 1)
2
En[x2ikε
2i ]
En[ε2i ]
1/2.
In this case, 2cτEn[x2ikεiξijS ] 6 En[x4
ikξ2ijS ]1/2En[ε2
i ]1/2 6 cτ−1
2 , and the requirement of V′′ for
k, j, S holds.
For Case 2, suppose the alternative that En[x4ikx
2ijS ] > 1
2cτ
(cτ−1)2
En[x2ik ε
2i ]
En[ε2i ]1/2 holds. Then
E[En[x4ikξ
2ijS ε
2i ]|x] is bounded away from zero by conditions on E[ε2
i |x] and maxi |εai |. In ad-
dition, E[En[|xik|6|ξijS |3|εi|3]|x] 6 maxi E[|εi|3|x]En[|xik|6|ξijS |3] 6 O(1)En[|xik|6|ξijS |3]. This
term is further bounded by O(1)En[x12ik ]1/2En[|ξijS |6]1/2. Using the same reasoning as bounding
Damian Kozbur/ATBFMS 16
En[ξ4ijS ] earlier, it follows that En[|ξijS |6]1/2 = O(1)s3
0. In addition, En[x12ik ] = O(1). As a result,
for those k, j, S which fall in Case 2, the self-normalized sum
= maxj,k,S∈Case 2
√n|En[x2
ikξijS εi]|√En[x4
ikξ2ijS ε
2i ]
is O(log(pKn)) with probability 1 − o(1) provided√
log(pKn) = o(n1/6/(s30)1/3). This
holds under the assumed rate conditions. Then maxj,k,S |En[x2ikξijS εi]| is bounded by
1√nO(log(pKn) maxj,k,S
√En[x4
ikξ2ijS ε
2i ]. Furthermore, En[x4
ikξ2ijS ε
2i ] 6 En[x8
ikξ4ijS ]1/2En[ε4
i ]1/2 6
(En[x12ik ]2/3En[ξ12
ijS ]1/3)1/2En[ε4i ]
1/2 6 O(1)s20En[ε4
i ]1/2. Note that En[ε4
i ]1/2 6 O(1) with PX → 1.
Together, these give that maxj,k,S En[x2ikεiξijS ] = o(1) with PX → 1. Finally, PX(V)→ 1.
S.5.3. Proof of Lemma 14
First, AjS depend only on x and are bounded above and below by constants which do not
depend on n on X from the assumption on the sparse eigenvalues of G. For bounding τjS above
and away from zero, since 1 6 ‖ηjS‖1, ‖ηjS‖2 6 O(1) on X, it is sufficient to show that the
eigenvalues of ΨεjS = En[xijSx
′ijS ε
2ijS ] remain bounded above and away from zero and that the
diagonal terms of ΨεjS remain bounded above and away from zero. Note that by arguments in
last section, it was shown that En[xikxil(εijS − εijS)] = O(√
log p/n) with PX → 1. Therefore,
‖En[xijSx′ijS ε
2ijS ]−En[xijSx
′ijSε
2ijS ]‖F = O(Kn
√log p/n ) with PX → 1. Here, F is the Frobenius
norm. By the assumed rate condition, the above quantity therefore vanishes with PX → 1.
Next,
En[xijSx′ijSε
2ijS ] = En[xijSx
′ijSε
2i ] + 2En[xijSx
′ijSεi(ξijS + E[εa
i |x])]
+En[xijSx′ijS(ξijS + E[εa
i |x])2]
The first term above, En[xijSx′ijSε
2i ], has eigenvalues bounded away from zero for all j, S
with PX → 1. The third term above, En[xijSx′ijS(ξijS + E[εa
i |x])2] is positive semidefinite by
construction. The second term above has Frobenius norm tending to zero for all j, S with PX → 1.
This, in conjunction with the fact that the eigenvalues of En[xijSx′ijS εijS ] are bounded above
and away from zero with PX → 1 shows that the eigenvalues of ΨεjS = En[xijSx
′ijS ε
2ijS ] are
bounded above and away from zero with PX → 1. Finally, for bounding VjS , it is sufficient to
show that maxk6p En[ε2i (η′jSxijS)2] be bounded above. This follows immediately from E[ε4
i |x]
being uniformly bounded and maxj,S ‖ηjS‖1 = O(1) and maxk6p En[x4ik] = O(1). These imply
that PX(B)→ 1.
References
A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal
instruments with an application to eminent domain. Econometrica, 80:2369–2429, 2012.
B. Y. Jing, Q. M. Shao, and Q. Wang. Self-normalized Cramer-type large deviations for inde-
pendent random variables. Annals of Probability, 31(4):2167–2215, 2003.