+ All Categories
Home > Documents > Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Date post: 09-Feb-2022
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
74
arXiv:2108.07940v2 [stat.ME] 3 Nov 2021 WEAK SIGNAL IDENTIFICATION AND INFERENCE IN PENALIZED LIKELIHOOD MODELS FOR CATEGORICAL RESPONSES 1 Yuexia Zhang, 2 Peibei Shi, 3 Zhongyi Zhu, 4 Linbo Wang and 5 Annie Qu 1,4 University of Toronto, 2 Facebook, 3 Fudan University, 5 University of California, Irvine Abstract: Penalized likelihood models are widely used to simultaneously select variables and estimate model parameters. However, the existence of weak signals can lead to inaccurate variable selection, biased parameter estimation, and invalid inference. Thus, identifying weak signals accurately and making valid inferences are crucial in penalized likelihood models. We develop a unified approach to identify weak signals and make inferences in penalized likelihood models, including the special case when the responses are categorical. To identify weak signals, we use the estimated selection probability of each covariate as a measure of the signal strength and formulate a signal identification criterion. To construct confidence intervals, we propose a two-step inference procedure. Extensive simulation studies show that the proposed procedure outperforms several existing methods. We illustrate the proposed method by applying it to the Practice Fusion diabetes data set. Key words and phrases: adaptive lasso, de-biased method, model selection, post-selection inference 1. Introduction In the big data era, massive data are collected with large-dimensional covariates. How- ever, only some of the covariates might be important. To select the important variables and estimate their effects on the response variable, various penalized likelihood models have been proposed, such as the penalized least squares regression model (Tibshirani, 1996; Zou and Hastie, 2005; Tibshirani et al., 2005; Yuan and Lin, 2006; Zou, 2006; Zhang, 2010), 1
Transcript
Page 1: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

arX

iv:2

108.

0794

0v2

[sta

t.ME

] 3

Nov

202

1

WEAK SIGNAL IDENTIFICATION AND INFERENCE

IN PENALIZED LIKELIHOOD MODELS

FOR CATEGORICAL RESPONSES

1Yuexia Zhang, 2Peibei Shi, 3Zhongyi Zhu, 4Linbo Wang and 5Annie Qu

1,4University of Toronto, 2Facebook, 3Fudan University, 5University of California, Irvine

Abstract:

Penalized likelihood models are widely used to simultaneously select variables and estimate model

parameters. However, the existence of weak signals can lead to inaccurate variable selection, biased

parameter estimation, and invalid inference. Thus, identifying weak signals accurately and making

valid inferences are crucial in penalized likelihood models. We develop a unified approach to identify

weak signals and make inferences in penalized likelihood models, including the special case when the

responses are categorical. To identify weak signals, we use the estimated selection probability of

each covariate as a measure of the signal strength and formulate a signal identification criterion. To

construct confidence intervals, we propose a two-step inference procedure. Extensive simulation studies

show that the proposed procedure outperforms several existing methods. We illustrate the proposed

method by applying it to the Practice Fusion diabetes data set.

Key words and phrases: adaptive lasso, de-biased method, model selection, post-selection inference

1. Introduction

In the big data era, massive data are collected with large-dimensional covariates. How-

ever, only some of the covariates might be important. To select the important variables

and estimate their effects on the response variable, various penalized likelihood models

have been proposed, such as the penalized least squares regression model (Tibshirani, 1996;

Zou and Hastie, 2005; Tibshirani et al., 2005; Yuan and Lin, 2006; Zou, 2006; Zhang, 2010),

1

Page 2: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

penalized logistic regression model (Park and Hastie, 2008; Zhu and Hastie, 2004; Wu et al.,

2009), and penalized Poisson regression model (Lambert and Eilers, 2005; Jia et al., 2019).

To achieve model selection consistency or the variable screening property for a high-

dimensional problem, a common condition is the “beta-min” condition, which requires the

nonzero regression coefficients to be sufficiently large (Zhao and Yu, 2006; Huang and Xie,

2007; Van de Geer et al., 2011; Tibshirani, 2011; Zhang and Jia, 2017). Therefore, classical

methods for variable selection often focus on strong signals that satisfy such a condition.

However, if the “beta-min” condition is violated, the important variables and unimportant

variables may be inseparable, and the true important variables might not be selected, even

if the sample size goes to infinity (Zhang, 2013). In finite samples, the estimators shrink the

true regression coefficients, owing to the penalty function. When the signal strength is weak,

its coefficient is more likely to shrink to zero (Shi and Qu, 2017; Liu et al., 2020). Inaccurate

variable selection and biased parameter estimation could lead to a poor post-selection infer-

ence, for example, the estimation of the confidence intervals could be inaccurate. Thus, both

strong and weak signals need to be considered. Identification and inference for weak signals

can also help discover potentially important variables in practice. For example, in genome-

wide association studies (GWAS), overlooked risk factors for a disease may be recovered by

incorporating weak signals (Liu et al., 2020).

For linear regression models, studies have been done on weak signals. In more extreme

cases, Jin et al. (2014) assumed all signals were individually weak and proposed graphlet

screening for variable selection. Zhang (2017) proposed the perturbed lasso, where signals

were strengthened by adding random perturbations to the design matrix. However, these

methods focused only on variable selection consistency, and did not aim to identify weak sig-

2

Page 3: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

nals or provide statistical inference. For weak signal identification and inference, Shi and Qu

(2017) proposed a weak signal identification procedure in finite samples, and introduced a

two-step inference method for constructing confidence intervals after signal identification.

However, their derivation relies on a crucial assumption that the design matrix is orthogo-

nal, which may not hold in practice. On the other hand, Li et al. (2019) took advantage of

the correlations between covariates, detecting weak signals through the partial correlations

between strong and weak signals. However, they did not study weak signal inference. Re-

cently, Liu et al. (2020) proposed a method that combines the bootstrap lasso and a partial

ridge regression for constructing confidence intervals when there are weak signals in the co-

variates. However, as stated in their paper, the confidence intervals of the coefficients, with

magnitudes of order 1/√n, may be invalid.

To the best of our knowledge, there has been little work on weak signals in likelihood-

based models for categorical responses. One exception is Reangsephet et al. (2020), who

proposed variable selection methods for logistic regression models with weak signals. How-

ever, they did not conduct weak signal identification or inference.

We address these gaps by developing a new unified approach to weak signal identification

and inference in penalized likelihood models, including the special case when the responses

are categorical. Specifically, the estimated probability of each covariate being selected by

the one-step adaptive lasso estimator is used to measure the signal strength. After signal

identification, a two-step inference procedure is proposed for constructing the confidence

intervals for the regression coefficients. The proposed method has several advantages. First,

we extend the method of Shi and Qu (2017) from linear regression models to likelihood-

based models, including generalized linear models. However, our extension is not trivial.

3

Page 4: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

For example, in Shi and Qu (2017), the selection probability has an explicit expression.

For the proposed likelihood-based method, such an explicit expression does not exist for

categorical responses. Thus, we propose a new method to estimate the selection probability.

Second, in Shi and Qu (2017), the selection probability for the covariate Xj is an increasing

function of |βj0|, where βj0 is the corresponding coefficient of Xj . Under our current general

framework, such a conclusion is not necessarily true. Thus, our signal identification criterion

is based directly on the estimated selection probability, in contrast to Shi and Qu (2017).

We also discuss how each signal’s selection probability is influenced by other covariates,

owing to nonlinear modeling or collinearity among the covariates; in Shi and Qu (2017),

the selection probability of one covariate is independent of those of other covariates. Third,

Shi and Qu (2017) assumed that the design matrix in a linear regression model is orthogonal,

whereas the proposed method relaxes this constraint. Fourth, the proposed inference method

differs from that of Shi and Qu (2017). Specifically, we construct confidence intervals for

the noise variables as well, whereas their method does not. Simulation results show that

our proposed two-step inference method outperforms the two-step inference method based

on Shi and Qu (2017). In particular, the proposed confidence intervals achieve accurate

coverage probabilities for all signal strength levels.

The remainder of this paper is organized as follows. In Section 2, we introduce the one-

step adaptive lasso estimator and derive the variable selection condition. In Section 3, we

propose the weak signal identification criterion. In Section 4, we develop a two-step inference

procedure for constructing confidence intervals. In Section 5, we conduct simulation studies

to assess the finite-sample performance of the proposed method. In Section 6, we apply the

proposed method to an analysis of diabetes data. In Section 7, we provide brief concluding

4

Page 5: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

remarks. We provide the technical proofs, implementation details of several methods, and

some additional results in the Supplementary Material.

2. One-step adaptive lasso estimator and variable selection condition

In this section, we introduce the one-step penalized likelihood estimator and derive the

condition for variable selection, which we use later for weak signal identification and inference.

Let (x⊤1 , y1)

⊤, . . . , (x⊤n , yn)

⊤ be n independent and identically distributed (i.i.d.) random

vectors, where xi = (xi1, . . . , xip)⊤ is a p×1 vector of predictors and yi is a response variable.

Assume that yi depends on xi through a linear combination x⊤i β0, and the conditional log-

likelihood of yi given xi is ℓi(γ0) = ℓi(α0+x⊤i β0, yi), where γ0 = (α0,β

⊤0 )

⊤, α0 is an unknown

true location parameter, and β0 = (β10, · · · , βp0)⊤ is an unknown p × 1 vector of covariate

effects. Note that for a likelihood-based model, it is not always possible to eliminate the

location parameter by centering the covariates and the response variable. For simplicity,

assume p < n and p is fixed. Let ℓ(γ) =∑n

i=1 ℓi(γ) denote the log-likelihood. Assume γ(0)

is the maximum likelihood estimator of γ0; then, γ(0) = (α(0),β(0)⊤)⊤ = argmaxγℓ(γ). In

matrix notation, we set X = (x1, . . . ,xn)⊤ = (X1, . . . ,Xp), with Xj = (x1j , . . . , xnj)

⊤ and

Y = (y1, . . . , yn)⊤. Furthermore, denote xi = (1,x⊤

i )⊤ and X = (1,X), where 1 is an n× 1

vector with all elements equal to one. Throughout this paper, we assume that E(xij) = 0 and

Var(xij) = 1, for all i ∈ 1, . . . , n and j ∈ 1, . . . , p, which can be realized by standardizing

the covariate matrix X, in practice.

Assume that some components of β0 are zero. In order to estimate the model parameters

and select important variables simultaneously, we consider the penalized likelihood function

ℓ(γ)/n−∑pj=1 pλj

(|βj |), where pλj(·) is a penalty function controlled by the tuning parameter

5

Page 6: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

λj. One popular penalty function is derived from the adaptive lasso estimator (Zou, 2006),

where pλj(|βj|) = λ|βj|/|β(0)

j |. Maximizing the penalized likelihood function is equivalent to

minimizing

−1

nℓ(γ) +

p∑

j=1

pλj(|βj|) (2.1)

with respect to γ. According to Wang and Leng (2007) and Zou and Li (2008), if the log-

likelihood function has first and second derivatives, then it can be approximated by a Taylor

expansion. Furthermore, the objective function (2.1) can be approximated by

Q1(γ) = − 1

2n(γ − γ(0))⊤ℓ(γ(0))(γ − γ(0)) +

p∑

j=1

pλj(|βj |), (2.2)

where ℓ(·) is the second derivative of function ℓ(·). The one-step penalized likelihood esti-

mator is γ(1) = (α(1),β(1)⊤)⊤ = argminγQ1(γ).

Denote µi(γ) = µi = x⊤i γ and ℓiµi(γ) = ℓi(x

⊤i γ, yi). Let D(γ) be an n× n diagonal

matrix with the (i, i)th element Dii(γ) = −∂2ℓiµi(γ)/∂µ2i , for i = 1, . . . , n. Then, ℓ(γ) =

−X⊤D(γ)X. Furthermore, we assume Dii(γ) is a continuous function of γ. For simplicity,

denote D(γ(0)), D(γ0), Dii(γ(0)), and Dii(γ0) as D

(0), D0, D(0)ii , and D0,ii, respectively. By

solving the equation ∂Q1(γ)/∂α = 0, we obtain that

α− α(0) = (1⊤D(0)1)−11⊤D(0)X(β(0) − β). (2.3)

6

Page 7: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Replacing α− α(0) by (2.3) in (2.2), we obtain the following objective function Q2(β):

Q2(β) =1

2n(β − β(0))⊤X⊤D†(0)X(β − β(0)) +

p∑

j=1

pλj(|βj|)

=1

2n(β − β(0))⊤X⊤D⋆(0)⊤D⋆(0)X(β − β(0)) +

p∑

j=1

pλj(|βj|),

(2.4)

where D†(0) = D(0)−D(0)1(1⊤D(0)1)−11⊤D(0) andD⋆(0) = (D(0))1/2−(D(0))1/21(1⊤D(0)1)−1

×1⊤D(0). Denote D†0 = D0−D01(1

⊤D01)−11⊤D0 and D⋆

0 = D1/20 −D

1/20 1(1⊤D01)

−11⊤D0,

correspondingly.

We focus mainly on weak signal identification using the one-step adaptive lasso estimator.

However, our method can be extended to other penalized likelihood estimators. Following the

idea of Zou and Li (2008), the algorithm for computing the one-step adaptive lasso estimator

γ(1) is as follows:

Step 1. Create the working data by X⋆ = D⋆(0)XW and Y ⋆ = D⋆(0)Xβ(0), where W =

diag|β(0)1 |, . . . , |β(0)

p |.

Step 2. Apply the coordinate descent algorithm to solve

β⋆ = argminβ

1

2n

n∑

i=1

(y⋆i −

p∑

j=1

x⋆ijβj

)2

+ λ

p∑

j=1

|βj|

, (2.5)

where β⋆ = (β⋆1 , . . . , β

⋆p)

⊤, y⋆i is the ith element of Y ⋆ and x⋆ij is the (i, j)th element

of X⋆.

Step 3. Obtain the value of β(1) = (β(1)1 , . . . , β

(1)p )⊤ using β

(1)j = β⋆

j |β(0)j |, for j = 1 . . . , p.

Step 4. Obtain the value of α(1) as α(1) = (1⊤D(0)1)−11⊤D(0)X(β(0) − β(1)) + α(0).

7

Page 8: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

From the above algorithm, if β⋆j 6= 0, then the covariate Xj will be selected. According

to (2.5), by using the coordinate descent algorithm, we obtain that

β⋆j = s

n∑i=1

(y⋆i −

∑k 6=j

x⋆ikβ

⋆k

)x⋆ij

n∑i=1

(x⋆ij)

2

,nλ

n∑i=1

(x⋆ij)

2

,

where s(z, r) = sgn(z)(|z| − r)+. Then, the condition for β⋆j 6= 0 (β

(1)j 6= 0) is

∣∣∣∣∣∣∣∣

n∑i=1

(y⋆i −

∑k 6=j

x⋆ikβ

⋆k

)x⋆ij

n∑i=1

(x⋆ij)

2

∣∣∣∣∣∣∣∣>

nλn∑

i=1

(x⋆ij)

2

. (2.6)

For each i ∈ 1, . . . , n and s ∈ 1, . . . , n, let d(0)is be the (i, s)th element of D⋆(0). Then the

variable selection condition (2.6) is equivalent to

∣∣∣∣∣n∑

i=1

( n∑

s=1

d(0)is xsj

)2(β

(0)j )2 +

k 6=j

n∑

i=1

( n∑

s=1

d(0)is xsk

)( n∑

s=1

d(0)is xsj

)β(0)j (β

(0)k − β

(1)k )

∣∣∣∣∣

> nλ. (2.7)

Similarly to the proof in Zou and Li (2008), we obtain that if the tuning parameter λ

satisfies the conditions of√nλ → 0 and nλ → ∞, then the one-step adaptive lasso estimator

enjoys model selection consistency, and the nonzero one-step adaptive lasso estimators have

the property of asymptotic normality.

8

Page 9: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

3. Weak signal definition and identification

3.1 Weak signal definition

Suppose a model contains both strong and weak signals. Without loss of generality, assume

the covariate matrix X consists of three components, that is, X = X(S),X(W ),X(N),

where X(S), X(W ), and X(N) represent the subsets of strong signals, weak signals, and noise

variables, respectively. Following Shi and Qu (2017), we use the selection probability of each

covariate to measure the signal strength. Specifically, for any penalized model selection

estimator β = (β1, . . . , βp)⊤, we define Pd,j as the probability of selecting the covariate

Xj , that is, Pd,j = P (βj 6= 0), j ∈ 1, . . . , p. For the one-step adaptive lasso estimator

β(1) = (β(1)1 , . . . , β

(1)p )⊤, based on the variable selection condition (2.7), Pd,j does not have

an explicit form. However, in the Supplementary Material S1, we show that Pd,j can be

approximated by P ∗d,j , where

P ∗d,j = Φ

−√

λE(D0,ii)

E(D0,iix2ij)E(D0,ii)−E(D0,iixij)2

+ βj0

√E(X⊤D0X)−1

j+1,j+1

+ Φ

−√

λE(D0,ii)

E(D0,iix2ij)E(D0,ii)−E(D0,iixij)2

− βj0

√E(X⊤D0X)−1

j+1,j+1

. (3.1)

Intuitively, in the derivation of the selection probability, we can omit the terms of (S2) and

(S3) in the Supplementary Material S1, and simplify the calculation using asymptotic theory.

Then we can relax the orthogonality assumption required in Shi and Qu (2017). We require

the following mild assumption to ensure (3.1) is valid.

Assumption 1. For each i ∈ 1, . . . , n and j ∈ 1, . . . , p, P (D0,ii > 0) = 1, E(D0,ii) < ∞,

E(D0,iix2ij) < ∞, and E(X⊤D0X) is positive definite.

9

Page 10: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

The condition P (D0,ii > 0) = 1 implies that the conditional log-likelihood function of

yi given xi, ℓiµi(γ), is a concave function of µi(γ). This is a necessary condition for the

uniqueness of the maximum likelihood estimator γ(0). In addition, according to the Cauchy–

Schwarz inequality, this also ensures that E(D0,iix2ij)E(D0,ii) − E(D0,iixij)2 > 0. The

conditions of E(D0,ii) < ∞ and E(D0,iix2ij) < ∞ guarantee that all expectations of random

variables in (3.1) are bounded for finite n. The positive-definite condition of E(X⊤D0X) is a

necessary condition for the asymptotic normality of the maximum likelihood estimator γ(0),

and ensures E(X⊤D0X)−1j+1,j+1 > 0.

For a deeper understanding of P ∗d,j, we first study the asymptotic properties of P ∗

d,j.

When βj0 = 0,

P ∗d,j = 2Φ

−√

nλ[E(D0,iix2

ij)E(D0,ii)−E(D0,iixij)2]/E(D0,ii)√E(X⊤D0X)/n−1

j+1,j+1

.

Under Assumption 1, [E(D0,iix2ij)E(D0,ii)−E(D0,iixij)2]/E(D0,ii) and E(X⊤D0X)/n−1

j+1,j+1

are both positive and bounded. If nλ → ∞, then P ∗d,j → 0.

When βj0 6= 0,

P ∗d,j = Φ

−√

n[√

λE(D0,ii)

E(D0,iix2ij)E(D0,ii)−E(D0,iixij)2

− βj0

]

√E(X⊤D0X)/n−1

j+1,j+1

+ Φ

−√

n[√

λE(D0,ii)

E(D0,iix2ij)E(D0,ii)−E(D0,iixij)2

+ βj0

]

√E(X⊤D0X)/n−1

j+1,j+1

.

If√nλ → 0, then P ∗

d,j → 1 under Assumption 1.

These asymptotic properties of P ∗d,j are consistent with the conclusion that the one-step

10

Page 11: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

adaptive lasso estimator enjoys model selection consistency if λ satisfies the conditions of

√nλ → 0 and nλ → ∞.

In the following, we study the finite-sample properties of P ∗d,j. To illustrate, we first

consider three special cases, where the likelihood-based model is taken as a linear regression

model, a logistic regression model, and a Poisson regression model, respectively.

Case One: Linear regression model

We first illustrate the simplest case under the linear regression model setting. Let yi =

α0+x⊤i β0+ εi, where εi

i.i.d.∼ N (0, σ2); then, D0,ii = 1/σ2. If we assume corr(xij , xik) = 0 for

any k, k 6= j, then

P ∗d,j = Φ

(βj0 −

√λσ

σ/√n

)+ Φ

(−βj0 −

√λσ

σ/√n

).

Note that if the tuning parameter λ is replaced by λShi = λσ2, then P ∗d,j has the same form

as that in Shi and Qu (2017), where the covariate matrix is assumed to be orthogonal. In

this case, P ∗d,j does not depend on γ

−j0 , where γ

−j0 stands for the components in γ0 other

than βj0. In addition, given any values in P ∗d,j except βj0, P

∗d,j is a symmetric function of

βj0 and increases with |βj0|. Thus, both P ∗d,j and |βj0| can be used to measure the signal

strength of Xj , as shown in Shi and Qu (2017).

However, if corr(xij , xik) 6= 0; for some k, k 6= j, then

P ∗d,j = Φ

βj0 −√λσ

σ/[√

n√

corr(X)−1j+1,j+1]

]

+ Φ

−βj0 −√λσ

σ/[√

n√

corr(X)−1j+1,j+1

]

.

Thus, P ∗d,j also depends on the correlations between covariates. Given any values in P ∗

d,j

except βj0, P ∗d,j is still a symmetric function of βj0 and an increasing function of |βj0|.

However, under different correlation structures of X, the shape of P ∗d,j can vary with the

11

Page 12: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

β10

P d,1*

case1case2case3

Figure 1: The plots for P ∗d,1 as β10 varies under three different cases in linear regression

models. In case 1, the correlation structure of X is taken to be the independence correlationstructure; in case 2, the correlation structure of X is taken to be the AR(1) correlation struc-ture with ρ = 0.5; in case 3, the correlation structure of X is taken to be the exchangeablecorrelation structure with ρ = 0.5. In all cases, n = 100, p = 5, λ = 0.2, σ = 1, and β10

varies between −1 and 1; with a step size of 0.05.

value of βj0. Therefore, both the value of |βj0| and the correlation structure of X influence

the signal strength of Xj , as illustrated in Figure 1.

Case Two: Logistic regression model

Under the logistic regression model setting,

E(yi|xi) = pi =exp(α0 + x⊤

i β0)

1 + exp(α0 + x⊤i β0)

.

We obtain that in (3.1), D0,ii = pi(1−pi) andD0 = diagp1(1−p1), . . . , pn(1−pn). Thus, P ∗d,j

not only depends on βj0, but also depends on γ−j0 , the coefficients of the other covariates.

This is a fundamental difference between logistic regression models and linear regression

models in terms of selection probability. In contrast to linear regression models, xi influences

12

Page 13: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

P ∗d,j through the matrix E[X⊤diagp1(1 − p1), . . . , pn(1 − pn)X], rather than through the

correlation matrix of X, in logistic regression models. In addition, in the Supplementary

Material S2.1, we show that P ∗d,j is not necessarily a symmetric function of βj0, given other

values in P ∗d,j. Thus, |βj0| cannot be used to measure the signal strength of Xj instead of

P ∗d,j, which differs from Shi and Qu (2017).

In addition, for the logistic regression model, the range of γ0 is bounded so that pi can

satisfy the condition 0 < c1 < pi < c2 < 1, where c1 and c2 are some positive constants.

We show that, given any values in P ∗d,j except βj0, P

∗d,j is an increasing function of βj0 if

0 < βj0 < c3, and P ∗d,j is a decreasing function of βj0 if −c4 < βj0 < 0, where c3 and c4 are

some bounded positive constants depending on c1 and c2. Proofs of the above findings are

provided in the Supplementary Material S2.2. We also illustrate these properties in Figure

2. Note that in this case, the response variable has two categories. However, it can be easily

extended to the case where there are more than two categories.

13

Page 14: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

β10

P d,1*

case1case2case3

Figure 2: The plots for P ∗d,1 as β10 varies under three different cases in logistic regression models. In

case 1, X1 andX2 both follow the standard normal distribution, andX1 andX2 are independent; incase 2, X1 and X2 both follow the centralized exponential distribution with mean zero and varianceone, and X1 and X2 are independent; in case 3, X1 and X2 both follow the standard normaldistribution, and X1 and X2 have the correlation of 0.5. In all cases, n = 300, γ0 = (0.3, β10 , 0.2)

′,λ = 0.05, and β10 varies between −1 and 1, with a step size of 0.05.

Case Three: Poisson regression model

Under the Poisson regression model setting,

P (yi = y|xi) =λyi

y!exp(−λi),

where λi = E(yi|xi) = exp(α0+x⊤i β0). Then, in (3.1), D0,ii = λi and D0 = diagλ1, . . . , λn.

We obtain similar conclusions to those for logistic regression models, except that P ∗d,j is

influenced by xi through the matrix E[X⊤diagλ1, . . . , λnX]. Note that under Assumption

1, the range of γ0 is bounded. Given any other values in P ∗d,j except βj0, P

∗d,j is an increasing

function of βj0 if 0 < βj0 < c5, and P ∗d,j is a decreasing function of βj0 if −c6 < βj0 < 0,

where c5 and c6 are some bounded positive constants. The proof for this finding is provided

14

Page 15: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

β10

P d,1*

case1case2case3

Figure 3: The plots for P ∗d,1 as β10 varies under three different cases in Poisson regression models. In

case 1, X1 andX2 both follow the standard normal distribution, andX1 andX2 are independent; incase 2, X1 and X2 both follow the centralized exponential distribution with mean 0 and variance 1,andX1 andX2 are independent; in case 3, X1 andX2 both follow the standard normal distribution,and X1 and X2 have the correlation of 0.5. In all cases, n = 300, γ0 = (0.3, β10 , 0.2)

′, λ = 0.05,β10 varies between −0.95 and 0.95, with a step size of 0.05.

in the Supplementary Material S2.2. Figure 3 illustrates P ∗d,j.

The finite-sample properties of P ∗d,j under other likelihood-based models can be analyzed

similarly. In general, P ∗d,j is an integrated indicator. It shows how the selection probability

of Xj is influenced by γ0, xi, n, and λ in finite samples. Given other values in P ∗d,j except

βj0, P∗d,j is not necessarily a symmetric function of βj0 or an increasing function of |βj0|.

Based on the above analysis, we propose using P ∗d,j to measure the signal strength levels

directly, rather than using |βj0|. Intuitively, if P ∗d,j is close to one, then the variable Xj is

defined to be a strong signal; if P ∗d,j is close to zero, then the variable Xj is defined to be a

noise variable; if P ∗d,j lies between the strong and noise levels, then the variable Xj is defined

to be a weak signal. Specifically, we introduce two threshold values, δs and δw. Then the

15

Page 16: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

three levels of signal strength can be defined as

Xj ∈ X(S), if P ∗d,j > δs;

Xj ∈ X(W ), if δw < P ∗d,j ≤ δs;

Xj ∈ X(N), if P ∗d,j ≤ δw,

(3.2)

where 0 < τw ≤ δw < δs ≤ τ s ≤ 1, τw = minj P∗d,j, and τ s = maxj P

∗d,j. Obviously, it is

easier to select a stronger signal using the variable selection process than it is to select a

weaker signal.

3.2 Weak signal identification

In this section, we show how to identify weak signals. Based on the analysis in Section

3.1, the approximated selection probability P ∗d,j depends on the true parameter γ0 and the

distribution of xi, but they are always unknown in practice. In the following, we estimate

P ∗d,j by plugging in the maximum likelihood estimator γ(0) and the empirical mean of the

random variables in (3.1). That is,

P ∗d,j = Φ

−√

nλ∑n

i=1 D(0)ii

∑ni=1 D

(0)ii x2

ij

∑ni=1 D

(0)ii −(

∑ni=1 D

(0)ii xij)2

+ β(0)j

√(X⊤D(0)X)−1

j+1,j+1

+ Φ

−√

nλ∑n

i=1 D(0)ii

∑ni=1 D

(0)ii x2

ij

∑ni=1 D

(0)ii −(

∑ni=1 D

(0)ii xij)2

− β(0)j

√(X⊤D(0)X)−1

j+1,j+1

. (3.3)

In practice, we identify the signal strength level of Xj based on P ∗d,j, and introduce two

threshold values δ1 and δ2. We denote the identified subsets of strong signals, weak signals,

16

Page 17: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

and noise variables as S(S), S(W ), and S(N), respectively:

S(S) = j : P ∗d,j > δ1;

S(W ) = j : δ2 < P ∗d,j ≤ δ1;

S(N) = j : P ∗d,j ≤ δ2.

(3.4)

The selections of δ1 and δ2 are crucial to determining the signal type. The threshold value

δ1 is selected to ensure that we can identify strong signals when the selection probabilities of

signals are high. Assume α is a significance level, and we choose δ1 to be larger than 1− α,

so that the identified strong signals are strong. The threshold value δ2 is selected to control

the false positive rate of selecting variable Xj. Denote the false positive rate as τ . Then τ

can be defined as

τ = P (j /∈ S(N) | βj0 = 0,γ−j0 ) = P (P ∗

d,j > δ2 | βj0 = 0,γ−j0 ). (3.5)

Thus, δ2 can be estimated based on (3.5). Because the value of γ0 is unknown in practice,

we estimate it using the one-step adaptive lasso estimator γ(1). Furthermore, to make the

estimated value of the false positive rate equal to τ based on the observed data, we take the

value of δ2 as the 100(1− τ)% quantile of P ∗d,j : β

(1)j = 0, j = 1, . . . , p. Because we intend

to recover weak signals given finite samples, τ is chosen to be larger than zero. However, the

value of τ cannot be too large, because there is a trade-off between recovering weak signals

and including noise variables. In practice, if we want to recover more weak signals, we can

choose a larger τ ; if we want to make the false positive rate lower, we can choose a smaller

τ . In the simulation studies, we perform a sensitivity analysis for the choice of δ1 and τ .

17

Page 18: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

4. Weak signal inference

In this section, we propose a two-step inference procedure for constructing confidence in-

tervals for the regression coefficients. The procedure consists of two parts: if a covariate

is identified as a strong signal, then its confidence interval is constructed based on the

asymptotic theory for the nonzero one-step adaptive lasso estimator (Zou and Li, 2008); if

a covariate is identified as a weak signal or a noise variable, then we provide a confidence

interval based on the following inference theory for the maximum likelihood estimator.

Similarly to the theory in Zou and Li (2008), we can obtain the asymptotic distribution

of the one-step adaptive lasso estimator. Without loss of generality, assume An = 1, . . . , s,

where s is the number of nonzero elements in β(1). Define Bn = k : γ(1)k 6= 0, k = 1, . . . , p+

1, then Bn = 1, . . . , s+1. Although the one-step adaptive lasso estimator β(1)An

is biased,

owing to the shrinkage effect in finite samples, we can construct a de-biased confidence

interval for the true coefficient based on the estimated bias and covariance matrix for β(1)An

,

as shown in Theorem 1. The proof of Theorem 1 is given in the Supplementary Material S3.

Theorem 1. Denote X⊤D†(0)X and X⊤D(0)X/n as Z(0) and I(0), respectively. The estima-

tors of the bias and the covariance matrix of β(1)An

are given by

bias(β(1)An) = −

1

nZ

(0)An

+ Σλ(β(0)An

,β(1)An)

−1(

λ

|β(0)1 |

sgn(β(1)1 ), . . . ,

λ

|β(0)s |

sgn(β(1)s )

)⊤

18

Page 19: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

and

cov(β(1)An

) =1

n3

1

nZ

(0)An

+ Σλ(β(0)An

,β(1)An)

−1

Z(0)An

(I(0)Bn

)−1AnZ

(0)An

×1

nZ

(0)An

+ Σλ(β(0)An

,β(1)An)

−1

,

respectively, where Σλ(β(0)An

,β(1)An) = diagλ/(|β(0)

1 ||β(1)1 |), . . . , λ/(|β(0)

s ||β(1)s |), Z(0)

Anis the sub-

matrix of Z(0) corresponding to β(0)An, and I

(0)Bn

is the sub-matrix of I(0) corresponding to γ(0)Bn

.

Based on Theorem 1, if the covariate Xj is identified as a strong signal, then the 100(1−

α)% confidence interval for βj0 can be constructed as

(β(1)j − bj − zα/2σj , β

(1)j − bj + zα/2σj), (4.1)

where bj is the corresponding component of bias(β(1)An

) and σj is the square root of the

corresponding diagonal component of cov(β(1)An).

If the covariate Xj is identified as a weak signal or a noise variable, then the 100(1−α)%

confidence interval for βj0 can be constructed as

(β(0)j − zα/2σ

(0)j , β

(0)j + zα/2σ

(0)j ), (4.2)

where σ(0)j is the square root of the corresponding diagonal component of cov(γ(0)) =

(X⊤D(0)X)−1.

Remark 1. Note that Shi and Qu (2017) did not construct confidence intervals for the noise

variables, whereas we do. As shown in Figure 6 in the simulation studies, this improves the

19

Page 20: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

coverage probabilities for the noise variables and weak signals. Using the two-step inference

method based on Shi and Qu (2017), the coverage probabilities for the noise variables tend

to be lower than 1−α, and the coverage probabilities for weak signals tend to be higher than

1 − α. This is because we construct confidence intervals for the noise variables only when

they are misidentified as weak signals or strong signals, in which case the estimated values of

the coefficients tend to be far from the true values, leading to lower coverage probabilities.

We do not construct confidence intervals for the weak signals when they are misidentified

as noise variables, making the coverage probabilities of the confidence intervals higher. To

solve these problems, we propose constructing confidence intervals for the identified noise

variables as well. As a result, the coverage probabilities of the confidence intervals become

closer to 1− α.

In summary, our proposed confidence interval for βj0 can be written as

(β(1)j − bj − zα/2σj , β

(1)j − bj + zα/2σj)Ij ∈ S(S)

+ (β(0)j − zα/2σ

(0)j , β

(0)j + zα/2σ

(0)j )Ij ∈ S(W ) ∪ S(N), (4.3)

which combines both (4.1) and (4.2).

5. Simulation studies

In this section, we conduct simulation studies to evaluate the finite-sample performance of

the proposed signal identification criterion and two-step inference procedure. Consider the

20

Page 21: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

following logistic regression model:

P (yi = 1 | xi) =exp(α0 + x⊤

i β0)

1 + exp(α0 + x⊤i β0)

, i = 1, . . . , n.

We generate the covariate vector xi = (xi1, . . . , xip)⊤ from a multivariate normal distribution

with mean zero and covariance matrix R(ρ)σ2, where R(ρ) is a correlation matrix with the

AR(1) correlation structure and σ2 = 1. All the generated covariates are standardized by

subtracting their sample means and dividing by their sample standard deviations. For each

setting, we choose n = 350 or 550, p = 25 or 35, ρ = 0, 0.2, or 0.5, and α0 = 0.5. The

regression coefficient vector β0 is set to (1, 1, 0.5, θ, 0, . . . , 0︸ ︷︷ ︸p−4

)⊤, which consists of two large

coefficients 1, one moderate size coefficient 0.5, one varying coefficient θ, and (p − 4) zero

coefficients. The coefficient θ ranges from zero to one, with a step size of 0.05. In each

simulation setting, we repeat the simulations 500 times. The implementation details of the

one-step adaptive lasso estimators are given in the Supplementary Material S4.

Figure 4 displays the results for different types of selection probability forX4 when ρ = 0.

In Figure 4, the approximated selection probability based on (3.1) is close to the empirical

selection probability, indicating a small approximation error from the approximated selection

probability. In addition, both the empirical selection probability and the approximated

selection probability increase with θ, implying that a larger value of θ leads to a stronger

signal strength. This observation supports the result in Section 3.1. Although the median

of the estimated selection probabilities is not too close to the empirical selection probability

when θ is small, the estimated selection probability still increases with the signal strength.

We can still use the estimated selection probability to identify the signal strength level.

The simulation results for the correlated covariates are provided in Figures S1 and S2 of

21

Page 22: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

the Supplementary Material S5, and the approximated selection probability is similar to the

empirical selection probability. In addition, the empirical selection probability, approximated

selection probability, and estimated selection probability, in general, increase with the value

of θ. Thus, we can also identify the signal strength level based on the value of θ.

We then identify whether a covariate is a strong signal, weak signal, or noise variable

based on the criterion in (3.4). For illustration, we choose δ1 to be 0.99 and τ to be 0.1.

Figure 5 represents the empirical probabilities of assigning the covariate X4 to different

signal categories as θ varies and ρ = 0. Figure 5 shows that when θ is close to zero, X4 is

more likely to be identified as a noise variable; when θ is far away from zero and one, the

empirical probability of X4 being identified as a weak signal is highest; as θ becomes larger,

the empirical probability of X4 being identified as a strong signal becomes more dominant,

and gradually increases to one. The results for the correlated covariates are given in Figures

S3 and S4 of the Supplementary Material S5, and we have similar findings. Therefore, our

proposed signal identification criterion (3.4) performs well in practice.

After identifying the signal strength levels, we construct the 95% confidence intervals

based on the proposed two-step inference procedure. We also compare our method with

the two-step inference method based on Shi and Qu (2017), which does not construct confi-

dence intervals for the identified noise variables. In addition, we construct confidence inter-

vals based on the asymptotic theory for the one-step adaptive lasso estimator, as shown

in (4.1), the maximum likelihood estimation method, as shown in (4.2), the perturba-

tion method (Minnier et al., 2011), the estimating equation-based method (Neykov et al.,

2018), the standard bootstrap method (Efron and Tibshirani, 1994), the smoothed boot-

strap method (Efron, 2014), the de-biased lasso method (Javanmard and Montanari, 2014;

22

Page 23: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 350, p = 25, ρ = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 350, p = 35, ρ = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 550, p = 25, ρ = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 550, p = 35, ρ = 0

Figure 4: Different types of selection probability for X4 when ρ = 0. Pdem: empirical selectionprobability, which is equal to the empirical probability of θ(1) 6= 0 based on 500 Monte Carlosamples; Pdapproxi: approximated selection probability based on (3.1), where the expectations in(3.1) are calculated by using the function cubintegrate in R; Pdest: median of estimated selectionprobabilities based on (3.3) for 500 Monte Carlo samples.

Van de Geer et al., 2014; Zhang and Zhang, 2014), and two different types of bootstrap de-

biased lasso methods (Dezeure et al., 2017). The number of bootstrap resampling is set to

4000 for all bootstrap methods, and the resampling number is set to 500 for the perturbation

23

Page 24: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 350, p = 25, ρ = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 350, p = 35, ρ = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 550, p = 25, ρ = 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 550, p = 35, ρ = 0

Figure 5: Empirical probabilities of assigning the covariate X4 to different signal categories whenρ = 0.

method. The implementation details of the estimating equation-based method and the two

types of bootstrap de-biased lasso methods can be found in the Supplementary Material S4.

For the method based on the asymptotic theory for the one-step adaptive lasso estimator, if

a variable is not selected, then we do not construct a confidence interval for it, because the

24

Page 25: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

asymptotic normality is established only for the selected variables.

Figures 6 and 7 provide coverage probabilities of the 95% confidence intervals as θ varies

and (n, p, ρ) = (350, 25, 0). In Figures 6 and 7, the vertical line on the left shows whether

X4 is more likely to be identified as a noise variable or a weak signal, and the vertical line on

the right distinguishes whether X4 is more likely to be identified as a weak signal or a strong

signal. The threshold values are obtained from Figure 5. Comparing the proposed two-step

inference method with the two-step inference method based on Shi and Qu (2017), when θ

is small, the former outperforms the latter. When θ is close to zero, the coverage probabil-

ity of the asymptotic method is too low and close to zero, while the perturbation method,

standard bootstrap method, smoothed bootstrap method, and type-I bootstrap de-biased

lasso method provide over-coverage confidence intervals, with coverage probabilities approx-

imating to one. When the signal is weak, the asymptotic method, perturbation method,

standard bootstrap method, smoothed bootstrap method, and type-I bootstrap de-biased

lasso method all perform poorly, and their coverage probabilities are much lower than 95%.

In addition, the coverage probability of the estimating equation-based method is slightly

lower than 95%. When the signal is stronger, the performance of the maximum likelihood

estimation method, estimating equation-based method, de-biased lasso method, and type-I

bootstrap de-biased lasso method also become worse. However, the coverage probabilities of

the 95% confidence intervals for the proposed method and the type-II bootstrap de-biased

lasso method are close to 95% under all signal strength levels of θ.

Figure 8 provides the average widths of the 95% confidence intervals as θ varies and

(n, p, ρ) = (350, 25, 0). Note that the widths of the confidence intervals for the two types

of two-step inference methods are both very close, while their coverage probabilities are not

25

Page 26: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

Cov

erag

e P

roba

bilit

y

Method

ProposedOldTwostepAsymMLEPerturbEstEq

Figure 6: Coverage probabilities of the 95% confidence intervals when (n, p, ρ) = (350, 25, 0).Proposed: the proposed two-step inference method; OldTwostep: the two-step inference methodbased on Shi and Qu (2017), which does not construct confidence intervals for identified noisevariables; Asym: the method based on the asymptotic theory using the one-step adaptive lassoestimator; MLE: the maximum likelihood estimation method; Perturb: the perturbation method;EstEq: the estimating equation-based method.

similar when θ is small. The width of the confidence interval using the proposed method is

between those of the maximum likelihood estimation method and the asymptotic method.

This is not surprising, because the proposed method combines the strengths of these two

methods. Although the confidence intervals based on the asymptotic method, perturbation

method, standard bootstrap method, and smoothed bootstrap method are narrow when θ

is close to zero, the coverage probabilities are not accurate, because they are either too

small or too large. When the signal is strong, the widths of the confidence intervals for the

perturbation method, standard bootstrap method, and smoothed bootstrap method are, in

general, larger than that for the proposed method. Although the estimating equation-based

26

Page 27: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

Cov

erag

e P

roba

bilit

y

Method

ProposedSdBSSmBSDeLassoBSDe1BSDe2

Figure 7: Coverage probabilities of the 95% confidence intervals when (n, p, ρ) = (350, 25, 0). Pro-posed: the proposed two-step inference method; SdBS: the standard bootstrap method; SmBS: thesmoothed bootstrap method; DeLasso: the de-biased lasso method; BSDe1: the type-I bootstrapde-biased lasso method; BSDe2: the type-II bootstrap de-biased lasso method.

method, de-biased lasso method, and type-I bootstrap de-biased lasso method have shorter

confidence intervals than that of the proposed method, their coverage probabilities of the

confidence intervals decrease as the signal becomes stronger. Overall, the confidence interval

for the type-II bootstrap de-biased lasso method is wider than that of the proposed method.

The coverage probabilities and average widths of the 95% confidence intervals under all

simulation settings are summarized in Tables S1–S4 of the Supplementary Material S5. For

each simulation setting, we select three different values of θ, under which X4 is identified as a

noise variable, weak signal, and strong signal, respectively. In summary, the findings from the

simulation setting of (n, p, ρ) = (350, 25, 0) still hold under other simulation settings when

ρ = 0. By comparison, the average widths of the confidence intervals for all methods decrease

27

Page 28: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

θ

Wid

th

Method

ProposedOldTwostepAsymMLEPerturbEstEqSdBSSmBSDeLassoBSDe1BSDe2

Figure 8: Average widths of the 95% confidence intervals when (n, p, ρ) = (350, 25, 0). Proposed:the proposed two-step inference method; OldTwostep: the two-step inference method based onShi and Qu (2017), which does not construct confidence intervals for identified noise variables;Asym: the method based on the asymptotic theory using the one-step adaptive lasso estimator;MLE: the maximum likelihood estimation method; Perturb: the perturbation method; EstEq: theestimating equation-based method; SdBS: the standard bootstrap method; SmBS: the smoothedbootstrap method; DeLasso: the de-biased lasso method; BSDe1: the type-I bootstrap de-biasedlasso method; BSDe2: the type-II bootstrap de-biased lasso method.

with the sample size and increase with the correlations between the covariates. When X4 is

not a strong signal, regardless of the correlations among covariates, the confidence intervals

for the asymptotic method have relatively low coverage probabilities. When X4 is a strong

signal, if ρ is 0 or 0.2, the asymptotic method provides accurate confidence intervals, but

if ρ increases to 0.5, the performance of the asymptotic method deteriorates. However, the

coverage probabilities of the confidence intervals for the proposed method are still close to

95% under all simulation settings.

In order to see whether the performance of the proposed method is sensitive to the choice

28

Page 29: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

of the threshold values δ1 and τ , we also consider other combinations of threshold values. For

example, when (n, p, ρ) = (350, 25, 0), we set τ as 0.1 and choose δ1 to be 0.96, 0.97, 0.98, or

0.99, which is larger than 1−α = 0.95. The empirical probabilities of assigning the covariate

X4 to different signal categories are shown in Figure S5 of the Supplementary Material S5.

As the value of δ1 becomes larger and the value of θ is fixed, the empirical probability of

identifying X4 as a weak signal becomes larger, and that of identifying X4 as a strong

signal becomes smaller if θ is not sufficiently large. Furthermore, the empirical probability

of identifying X4 as a noise variable does not change. This is because of the proposed

signal identification criterion. Figures S6–S7 in the Supplementary Material S5 show the

corresponding coverage probabilities and average widths of the 95% confidence intervals for

the proposed two-step inference method. As shown, the coverage probability becomes larger

as δ1 increases and θ is between 0.6 and 0.75, and the average width becomes larger as δ1

increases and θ is between 0.15 and 0.75. This is not surprising because when δ1 increases,

the probability of using the maximum likelihood method to construct the confidence intervals

becomes larger. As shown in Figures 7 and 8, when θ is not too large, the coverage probability

and average width of the confidence interval based on the maximum likelihood method is

higher than that based on the asymptotic method. However, as δ1 varies, the changes of the

coverage probability and average width are not large.

We also consider another situation where δ1 is set to 0.99 and τ is chosen to be 0.05, 0.1, 0.15,

or 0.2. Figure S8 in the Supplementary Material S5 shows the empirical probabilities of as-

signing the covariate X4 to different signal categories in this situation. Here, we find that as

τ increases, the empirical probability of identifying X4 as a weak signal is larger, and that of

identifying X4 as a noise variable is smaller if θ is not too large. The empirical probability

29

Page 30: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

of identifying X4 as a strong signal remains the same. This is consistent with the proposed

signal selection criterion. However, because the proposed two-step inference method uses

the same confidence interval construction method for the identified noise variables and weak

signals, the confidence interval does not change with the value of τ , as shown in Figures

S9–S10 of the Supplementary Material S5.

We also examine whether the performance of the proposed method is sensitive to the

total number of weak signals. We reset the regression coefficient vector β0 to be

(1, 1, 0.5, θ, 0.3, . . . , 0.3︸ ︷︷ ︸q

, 0, . . . , 0︸ ︷︷ ︸p−q−4

)⊤, where q is taken to be 0, 1, 2, 3. For illustration, let

(n, p, ρ) = (350, 25, 0), δ1 be 0.99, and τ be 0.1. Based on the signal identification crite-

rion, all the q covariates corresponding to the coefficient 0.3 are weak signals if θ ranges

from zero to one. If the covariate X4 is identified as a weak signal, then the total number of

weak signals is q+1; otherwise it is q. The empirical probabilities of assigning the covariate

X4 to different signal categories are shown in Figure S11 of the Supplementary Material S5,

which are not sensitive to the value of q. Figures S12–S13 in the Supplementary Material

S5 respectively show the coverage probabilities and average widths of the 95% confidence

intervals for the proposed two-step inference method, showing that when θ is small, the

average width increases with the value of q, while the coverage probability does not change

monotonously with the value of q. In addition, as q varies, the variations of average width

and coverage probability are not large. Thus, the performance of the proposed method is

quite robust to the total number of weak signals.

30

Page 31: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

6. Real-data application

To illustrate the performance of the proposed method, we apply it to a data set in the Practice

Fusion diabetes study, which was provided by Kaggle as part of the “Practice Fusion Diabetes

Classification” challenge (Kaggle, 2012). The data set consists of de-identified electronic

medical records for over 10, 000 patients. There are a total of 9948 patients in the training

data, including a binary variable indicating whether a patient is diagnosed with Type 2

diabetes mellitus (T2DM), or not. In this analysis, we aim to determine the most important

risk factors for the incidence of T2DM, which can be used to identify patients with a high

risk of T2DM.

We first extract 119 predictors from the predictors selected by the first-place winner in the

Kaggle competition by removing some highly correlated predictors (details can be found in

https://www.kaggle.com/c/pf2012-diabetes/overview/winners). These predictors can

be divided into six categories: basic information, transcript records, diagnosis information,

medication information, lab result, and smoking status. Detailed information about these

predictors can be found in Table S5 in the Supplementary Material S6. One outlying patient

is also removed owing to inaccurate information on the predictors. All the predictors are

standardized beforehand. We adopt the following logistic regression model to fit the data

set:

P (yi = 1 | xi) =

exp(α+

p∑j=1

xijβj

)

1 + exp(α +

p∑j=1

xijβj

) , i = 1, . . . , n,

where p = 119 and n = 9947.

We first obtain the one-step adaptive lasso estimates of the regression coefficients follow-

ing the tuning parameter selection procedure given in the Supplementary Material S4. We

31

Page 32: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

then identify whether a predictor is a strong signal, weak signal, or noise variable based on

criterion (3.4). Here, we choose δ1 to be 0.99 and τ to be 0.1. From all the predictors, we

identify 18 strong signals, 32 weak signals, and 69 noise variables. The 18 strong signals are

all selected by the one-step adaptive lasso estimator, indicating consistency between it and

our method for strong signal selection. Among the 32 weak signals, 24 are also selected by

the one-step adaptive lasso estimator, while the other eight predictors are only identified by

our method. These eight additional predictors include the numbers of times being diagnosed

with herpes zoster, hypercholesterolemia, hypertensive heart disease, respiratory infection,

sleep apnea, and joint pain, and the number of transcripts for cardiovascular disease and the

number of diagnoses per weighted year. The relationships between these eight predictors

and diabetes have also been studied by other researchers. For example, Papagianni et al.

(2018) reviewed studies on associations between herpes zoster and diabetes mellitus, and

found that herpes zoster and T2DM were likely to coexist for the same patient.

Next, we construct the 95% confidence intervals using our two-step inference method,

together with all other comparison methods in Section 5. Figure 9 shows the average widths

of the confidence intervals for the strong and weak signals. For both, the widths of the

confidence intervals for the two types of two-step inference methods are the same. For

strong signals, the proposed method and the asymptotic method provide the shortest con-

fidence intervals. For weak signals, the widths of the confidence intervals based on the pro-

posed method are smaller than those based on the perturbation method, standard bootstrap

method, smoothed bootstrap method, and two types of bootstrap de-biased lasso methods.

32

Page 33: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0

0.1

0.2

0.3

Proposed OldTwostep Asym MLE Perturb EstEq SdBS SmBS DeLasso BSDe1 BSDe2

Method

Wid

th

Signal

Strong signals

Weak signals

Figure 9: The average widths of the 95% confidence intervals for the diabetes data set. Note thatthe asymptotic method does not construct confidence intervals for all the weak signals, the resultfor the weak signals is the average width of the confidence intervals for the weak signals, which arealso selected by the asymptotic method. For the meanings of the notation, see Figures 6 and 7.

7. Conclusion

We have proposed a new unified approach for weak signal identification and inference in

penalized likelihood models, including the special case when the responses are categorical. To

identify weak signals, we propose using the estimated selection probability of each covariate

as a measure of the signal strength, and develop a signal identification criterion based directly

on the estimated selection probability. To construct confidence intervals for the regression

coefficients, we propose a two-step inference procedure. Extensive simulation studies and

33

Page 34: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

a real-data application show that the proposed signal identification method and two-step

inference procedure outperform several existing methods in finite samples.

The proposed method can be extended to a high-dimensional setting where p is not fixed.

One possible way is to use the de-biased lasso estimator as an initial estimator for the one-step

adaptive lasso estimator, and then leverage the asymptotic properties of the de-biased lasso

estimator to derive the selection probability. We can also use a penalized method to estimate

the inverse of the information matrix, such as the CLIME estimator (Cai et al., 2011). In

addition, our signal identification and inference framework can be extended to longitudinal

data. For longitudinal data, we can replace the negative log-likelihood function with the

generalized estimating function in the estimation. Finally, in the fields of causal inference

and econometrics, there is a popular “weak instrument” problem (Chao and Swanson, 2005;

Burgess and Thompson, 2011; Choi et al., 2018), which can be considered a weak signal

problem. This is worth further development using our approach.

Supplementary Material

The online Supplementary Material contains six sections. Section S1 derives the ap-

proximated selection probability. Section S2 provide an additional detailed analysis of the

approximated selection probability in finite samples. Section S3 contains a proof for Theo-

rem 1. Section S4 presents the implementation details of several methods. Sections S5 and

S6 provide additional simulation results and information related to the real-data application,

respectively.

Acknowledgments

This work was partially supported by the National Science Foundation of the United

34

Page 35: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

States (DMS-1821198, DMS-1952406), National Natural Science Foundation of China (11671096,

11731011, 12071087), and Natural Sciences and Engineering Research Council of Canada

(RGPIN-2019-07052, DGECR-2019-00453, RGPAS-2019-00093).

References

Burgess, S. and Thompson, S. G. (2011). Bias in causal estimates from mendelian randomization studies with weak

instruments. Statistics in Medicine, 30(11):1312–1323.

Cai, T., Liu, W., and Luo, X. (2011). A constrained L1 minimization approach to sparse precision matrix estimation.

Journal of the American Statistical Association, 106(494):594–607.

Chao, J. C. and Swanson, N. R. (2005). Consistent estimation with a large number of weak instruments. Econometrica,

73(5):1673–1692.

Choi, J., Gu, J., and Shen, S. (2018). Weak-instrument robust inference for two-sample instrumental variables

regression. Journal of Applied Econometrics, 33(1):109–125.

Dezeure, R., Buhlmann, P., and Zhang, C.-H. (2017). High-dimensional simultaneous inference with the bootstrap.

Test, 26(4):685–719.

Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American Statistical Association,

109(507):991–1007.

Efron, B. and Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

Huang, J. and Xie, H. (2007). Asymptotic oracle properties of scad-penalized least squares estimators. In Asymptotics:

Particles, processes and inverse problems, pages 149–166. Institute of Mathematical Statistics.

35

Page 36: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression.

The Journal of Machine Learning Research, 15(1):2869–2909.

Jia, J., Xie, F., and Xu, L. (2019). Sparse poisson regression with penalized weighted score function. Electronic

Journal of Statistics, 13(2):2898–2920.

Jin, J., Zhang, C.-H., and Zhang, Q. (2014). Optimality of graphlet screening in high dimensional variable selection.

The Journal of Machine Learning Research, 15(1):2723–2772.

Kaggle (2012). Practice fusion diabetes classification. data retrieved from Kaggle competition dataset,

http://www.kaggle.com/c/pf2012-diabetes.

Lambert, P. and Eilers, P. H. (2005). Bayesian proportional hazards model with time-varying regression coefficients:

A penalized poisson regression approach. Statistics in Medicine, 24(24):3977–3989.

Li, Y., Hong, H. G., Ahmed, S. E., and Li, Y. (2019). Weak signals in high-dimensional regression: Detection,

estimation and prediction. Applied Stochastic Models in Business and Industry, 35(2):283–298.

Liu, H., Xu, X., and Li, J. J. (2020). A bootstrap lasso+ partial ridge method to construct confidence intervals for

parameters in high-dimensional sparse linear models. arXiv preprint arXiv:1706.02150.

Minnier, J., Tian, L., and Cai, T. (2011). A perturbation method for inference on regularized regression estimates.

Journal of the American Statistical Association, 106(496):1371–1382.

Neykov, M., Ning, Y., Liu, J. S., Liu, H., et al. (2018). A unified theory of confidence regions and testing for

high-dimensional estimating equations. Statistical Science, 33(3):427–443.

Papagianni, M., Metallidis, S., and Tziomalos, K. (2018). Herpes zoster and diabetes mellitus: a review. Diabetes

Therapy, 9(2):545–550.

36

Page 37: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Park, M. Y. and Hastie, T. (2008). Penalized logistic regression for detecting gene interactions. Biostatistics, 9(1):30–

50.

Reangsephet, O., Lisawadi, S., and Ahmed, S. E. (2020). Weak signals in high-dimensional logistic regression models.

In International Conference on Management Science and Engineering Management, pages 121–133. Springer.

Shi, P. and Qu, A. (2017). Weak signal identification and inference in penalized model selection. The Annals of

Statistics, 45(3):1214–1253.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series

B (Statistical Methodology), 58(1):267–288.

Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical

Society: Series B (Statistical Methodology), 73(3):273–282.

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via the fused lasso.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108.

Van de Geer, S., Buhlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimal confidence regions

and tests for high-dimensional models. Annals of Statistics, 42(3):1166–1202.

Van de Geer, S., Buhlmann, P., and Zhou, S. (2011). The adaptive and the thresholded lasso for potentially misspec-

ified models (and a lower bound for the lasso). Electronic Journal of Statistics, 5:688–749.

Wang, H. and Leng, C. (2007). Unified lasso estimation by least squares approximation. Journal of the American

Statistical Association, 102(479):1039–1048.

Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., and Lange, K. (2009). Genome-wide association analysis by lasso

penalized logistic regression. Bioinformatics, 25(6):714–721.

37

Page 38: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics,

38(2):894–942.

Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear

models. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 217–242.

Zhang, H. and Jia, J. (2017). Elastic-net regularized high-dimensional negative binomial regression: consistency and

weak signals detection. arXiv preprint arXiv:1712.03412.

Zhang, T. (2013). Multi-stage convex relaxation for feature selection. Bernoulli, 19(5B):2277–2293.

Zhang, Y. (2017). Recovery of weak signal in high dimensional linear regression by data perturbation. Electronic

Journal of Statistics, 11(2):3226–3250.

Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–

2563.

Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression. Biostatistics,

5(3):427–443.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association,

101(476):1418–1429.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical

Society: Series B (Statistical Methodology), 67(2):301–320.

38

Page 39: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics,

36(4):1509–1533.

Department of Computer and Mathematical Sciences, University of Toronto, ON M1C 1A4, Canada.

E-mail: [email protected]

Facebook, Menlo Park, CA 94025, United States.

E-mail: [email protected]

Department of Statistics, Fudan University, Shanghai 200433, China.

E-mail: [email protected]

Department of Statistical Sciences, University of Toronto, ON M5S 3G3, Canada.

E-mail: [email protected]

Department of Statistics, University of California, Irvine, CA 92697, United States.

E-mail: [email protected]

39

Page 40: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

SUPPLEMENTARY MATERIALS FOR

“WEAK SIGNAL IDENTIFICATION AND INFERENCE

IN PENALIZED LIKELIHOOD MODELS

FOR CATEGORICAL RESPONSES”

1Yuexia Zhang, 2Peibei Shi, 3Zhongyi Zhu, 4Linbo Wang and 5Annie Qu

1,4University of Toronto, 2Facebook, 3Fudan University, 5University of California, Irvine

Abstract:

The online Supplementary Material contains six sections. Section S1 derives the approximated selection

probability. Section S2 provide an additional detailed analysis of the approximated selection probability

in finite samples. Section S3 contains a proof for Theorem 1. Section S4 presents the implementation

details of several methods. Sections S5 and S6 provide additional simulation results and information

related to the real-data application, respectively.

40

Page 41: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

S1. Derivation of the approximated selection probability

In Section 2 of the main paper, we have obtained the following condition for selecting the

covariate Xj , j ∈ 1, . . . , p:

∣∣∣∣∣n∑

i=1

( n∑

s=1

d(0)is xsj

)2(β

(0)j )2 +

k 6=j

n∑

i=1

( n∑

s=1

d(0)is xsk

)( n∑

s=1

d(0)is xsj

)β(0)j (β

(0)k − β

(1)k )

∣∣∣∣∣

> nλ.

It is equivalent to

∣∣∣∣∣

n∑i=1

( n∑s=1

d(0)is xsj

)2(β

(0)j )2

n+

∑k 6=j

n∑i=1

( n∑s=1

d(0)is xsk

)( n∑s=1

d(0)is xsj

)β(0)j (β

(0)k − βk0 + βk0)

n

∑k 6=j

n∑i=1

( n∑s=1

d(0)is xsk

)( n∑s=1

d(0)is xsj

)β(0)j (β

(1)k − βk0 + βk0)

n

∣∣∣∣∣

=

∣∣∣∣∣

n∑i=1

( n∑s=1

d(0)is xsj

)2(β

(0)j )2

n+

∑k 6=j

n∑i=1

( n∑s=1

d(0)is xsk

)( n∑s=1

d(0)is xsj

)β(0)j (β

(0)k − βk0)

n

∑k 6=j

n∑i=1

( n∑s=1

d(0)is xsk

)( n∑s=1

d(0)is xsj

)β(0)j (β

(1)k − βk0)

n

∣∣∣∣∣

>λ.

(S1.1)

We consider the following three formulas respectively,

n∑i=1

( n∑s=1

d(0)is xsj

)2(β

(0)j )2

n,

41

Page 42: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

∑k 6=j

n∑i=1

( n∑s=1

d(0)is xsk

)( n∑s=1

d(0)is xsj

)β(0)j (β

(0)k − βk0)

n, (S1.2)

and ∑k 6=j

n∑i=1

( n∑s=1

d(0)is xsk

)( n∑s=1

d(0)is xsj

)β(0)j (β

(1)k − βk0)

n. (S1.3)

Since d(0)is is the (i, s)th element of D⋆(0), D⋆(0) = (D(0))1/2 − (D(0))1/21

× (1⊤D(0)1)−11⊤D(0) and D(0) is an n × n diagonal matrix with the (i, i)th element D(0)ii ,

then by calculation,

n∑i=1

( n∑s=1

d(0)is xsj

)2

n=

n∑i=1

D(0)ii x2

ij

n−

( n∑

i=1D

(0)ii xij

n

)2

n∑

i=1D

(0)ii

n

.

Since (x1, y1), . . . , (xn, yn) are independent and identically distributed random vectors, Dii(γ)

is a continuous function of γ and the maximum likelihood estimator γ(0) P→ γ0 under some

regularity conditions, then by the Law of Large Numbers and Continuous Mapping Theorem,

we have∑n

i=1D(0)ii x2

ij/nP→ E(D0,iix

2ij),

∑ni=1D

(0)ii xij/n

P→ E(D0,iixij) and∑n

i=1D(0)ii /n

P→ E(D0,ii). Then

n∑i=1

( n∑s=1

d(0)is xsj

)2(β

(0)j )2

n−[E(D0,iix

2ij)−

E(D0,iixij)2E(D0,ii)

](β

(0)j )2

P→ 0.

42

Page 43: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

By calculation, (S1.2) equals

k 6=j

n∑i=1

xikD(0)ii xij

n−

n∑i=1

n∑s=1

xikD(0)ii D

(0)ss xsj

nn∑

i=1

D(0)ii

β

(0)j (β

(0)k − βk0)

=∑

k 6=j

n∑i=1

xikD(0)ii xij

n−

n∑

i=1xikD

(0)ii

n

n∑

s=1D

(0)ss xsj

nn∑

i=1D

(0)ii

n

β

(0)j

√n(β

(0)k − βk0)√n

.

Because of the same reason as before,∑n

i=1 xikD(0)ii xij/n

P→ E(xikD0,iixij),∑n

i=1 xikD(0)ii /n

P→

E(xikD0,ii),∑n

s=1D(0)ss xsj/n

P→ E(D0,ssxsj) and∑n

i=1D(0)ii /n

P→ E(D0,ii). By the Central

Limit Theorem,√n(β

(0)k − βk0)

D→ N (0, I−1(γ0)k+1,k+1), where I(γ0) = E(X⊤D0X)/n.

Then√n(β

(0)k − βk0) = Op(1). Furthermore, since β

(0)j

P→ βj0 and the number of covariates

p is finite, then according to the Slutsky’s Theorem, (S1.2) is Op(1/√n).

Based on the oracle properties of β(1), if βk0 = 0, then P (β(1)k = 0) → 1. Therefore,

similar to the previous proof,

n∑i=1

( n∑s=1

d(0)is xsk

)( n∑s=1

d(0)is xsj

)β(0)j (β

(1)k − βk0)

n

=

n∑i=1

xikD(0)ii xij

n−

n∑

i=1xikD

(0)ii

n

n∑

s=1D

(0)ss xsj

nn∑

i=1D

(0)ii

n

β

(0)j (β

(1)k − βk0)

P→ 0.

(S1.4)

If βk0 6= 0, then√n(β

(1)k − βk0)

D→ N (0, [I−1(γ0)A ]Xk), where I(γ0)A is the Fisher

information matrix knowing (γ0)A c = 0 and [I−1(γ0)A ]Xkis an element of the matrix

43

Page 44: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

I−1(γ0)A corresponding to Xk. Therefore,√n(β

(1)k − βk0) = Op(1). Furthermore,

n∑i=1

( n∑s=1

d(0)is xsk

)( n∑s=1

d(0)is xsj

)β(0)j (β

(1)k − βk0)

n

=

n∑i=1

xikD(0)ii xij

n−

n∑

i=1xikD

(0)ii

n

n∑

s=1D

(0)ss xsj

nn∑

i=1D

(0)ii

n

β

(0)j

√n(β

(1)k − βk0)√n

= Op

(1√n

).

(S1.5)

According to (S1.4) and (S1.5), (S1.3) is also Op(1/√n).

In summary, the condition for selecting the covariate Xj becomes

∣∣∣∣[E(D0,iix

2ij)−

E(D0,iixij)2E(D0,ii)

](β

(0)j )2 + op(1)

∣∣∣∣ > λ.

Furthermore,

P (β(1)j 6= 0) ≈ P

([E(D0,iix

2ij)−

E(D0,iixij)2E(D0,ii)

](β

(0)j )2 > λ

). (S1.6)

By the Central Limit Theorem,√n(β

(0)j − βj0)

D→ N (0, I−1(γ0)j+1,j+1) and I(γ0) =

E(X⊤D0X)/n. Therefore, the right hand side of (S1.6) can be approximated by

P ∗d,j = Φ

−√

λE(D0,ii)

E(D0,iix2ij)E(D0,ii)−E(D0,iixij)2

+ βj0

√E(X⊤D0X)−1

j+1,j+1

+ Φ

−√

λE(D0,ii)

E(D0,iix2ij)E(D0,ii)−E(D0,iixij)2

− βj0

√E(X⊤D0X)−1

j+1,j+1

. (S1.7)

44

Page 45: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

S2. Additional detailed analysis of the approximated selection probability in

finite samples

In this selection, we provide an additional detailed analysis of finite-sample properties of the

approximated selection probability P ∗d,j and provide some plots to illustrate the finite-sample

properties of P ∗d,j under three different kinds of likelihood-based models.

S2.1 Symmetry of the approximated selection probability

In order to study given any values in P ∗d,j except βj0, whether P ∗

d,j is a symmetric function

of βj0 or not, we need to study for any βj0 6= 0, whether P ∗d,j(βj0) is equal to P ∗

d,j(−βj0).

According to (S1.7),

P ∗d,j(βj0) =Φ

−√

λED0,ii(βj0,γ−j0 )

ED0,ii(βj0,γ−j0 )x2

ijED0,ii(βj0,γ

−j0 )−[ED0,ii(βj0,γ

−j0 )xij]2

+ βj0

√[EX⊤D0(βj0,γ

−j0 )X

]−1

j+1,j+1

+ Φ

−√

λED0,ii(βj0,γ−j0 )

ED0,ii(βj0,γ−j0 )x2

ijED0,ii(βj0,γ−j0 )−[ED0,ii(βj0,γ

−j0 )xij]2

− βj0

√[EX⊤D0(βj0,γ

−j0 )X

]−1

j+1,j+1

and

P ∗d,j(−βj0) =Φ

−√

λED0,ii(−βj0,γ−j0 )

ED0,ii(−βj0,γ−j0 )x2

ijED0,ii(−βj0,γ−j0 )−[ED0,ii(−βj0,γ

−j0 )xij]2

− βj0

√[EX⊤D0(−βj0,γ

−j0 )X

]−1

j+1,j+1

+ Φ

−√

λED0,ii(−βj0,γ−j0 )

ED0,ii(−βj0,γ−j0 )x2

ijED0,ii(−βj0,γ−j0 )−[ED0,ii(−βj0,γ

−j0 )xij]2

+ βj0

√[EX⊤D0(−βj0,γ

−j0 )X

]−1

j+1,j+1

.

45

Page 46: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

SinceD0,ii(βj0,γ−j0 ) = −∂2ℓiµi(βj0,γ

−j0 )/∂µ2

i with µi(βj0,γ−j0 ) = α0+

∑k 6=j xikβk0+xijβj0,

and D0,ii(−βj0,γ−j0 ) = −∂2ℓiµi(−βj0,γ

−j0 )/∂µ2

i with µi(−βj0,γ−j0 ) = α0 +

∑k 6=j xikβk0 −

xijβj0, then one of the sufficient conditions for P ∗d,j(βj0) = P ∗

d,j(−βj0) is that the distribu-

tion of xij is symmetric about zero and xij is independent of xik for any k 6= j. Under

this condition, we have ED0,ii(βj0,γ−j0 ) = ED0,ii(−βj0,γ

−j0 ), ED0,ii(βj0,γ

−j0 )x2

ij =

ED0,ii(−βj0,γ−j0 )x2

ij, ED0,ii(βj0,γ−j0 )xij = −ED0,ii(−βj0,γ

−j0 )xij and EX⊤D0(βj0,γ

−j0 )X =

EX⊤D0(−βj0,γ−j0 )X. Furthermore, P ∗

d,j(βj0) = P ∗d,j(−βj0).

However, this sufficient condition may not be satisfied in practice and it is easy to find

a case where P ∗d,j(βj0) 6= P ∗

d,j(−βj0). So given any values in P ∗d,j except βj0, P ∗

d,j is not

necessarily a symmetric function of βj0.

S2.2 Monotonicity of the approximated selection probability

In order to study the monotonicity of the approximated selection probability, we need to

study the first order derivative of P ∗d,j with respect to βj0. By calculation,

∂P ∗d,j

∂βj0=

1

f2jφ

(−√

f1j − βj0√f2j

)δ(βj0),

where

f1j =λE(D0,ii)

E(D0,iix2ij)E(D0,ii)− E(D0,iixij)2

,

f2j = E(X⊤D0X)−1j+1,j+1,

46

Page 47: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

and

δ(βj0)

=

[− 1

2(f1j)

− 12∂f1j∂βj0

+ 1√

f2j −1

2(f2j)

− 12 (−

√f1j + βj0)

∂f2j∂βj0

]exp

(2√f1jβj0

f2j

)

+− 1

2(f1j)

− 12∂f1j∂βj0

− 1√

f2j +1

2(f2j)

− 12 (√f1j + βj0)

∂f2j∂βj0

,

with

∂f1j∂βj0

∂E(D0,ii)

∂βj0

[E(D0,iix

2ij)E(D0,ii)− E(D0,iixij)2

][E(D0,iix2

ij)E(D0,ii)− E(D0,iixij)2]2

−λE(D0,ii)

∂E(D0,iix2

ij)

∂βj0E(D0,ii) + E(D0,iix

2ij)

∂E(D0,ii)

∂βj0− 2E(D0,iixij)

∂E(D0,iixij)

∂βj0

[E(D0,iix2

ij)E(D0,ii)− E(D0,iixij)2]2 ,

∂f2j∂βj0

=[E(X⊤D0X)−1

E(X⊤M0X)

E(X⊤D0X)−1

]j+1,j+1

,

and

M0 = diag

∂3ℓ1µ1(γ0)

∂µ31

x1j , . . . ,∂3ℓnµn(γ0)

∂µ3n

xnj

.

To simplify the proof, we first consider the case where (xi, yi) follows a logistic regression

model, that is,

E(yi|xi) = pi =exp(α0 + x⊤

i β0)

1 + exp(α0 + x⊤i β0)

.

By calculation, D0,ii = pi(1− pi) and D0 = diagp1(1− p1), . . . , pn(1− pn). Assume p = 2,

xi1 and xi2 are independent, E(xij) = 0 and Var(xij) = 1, j = 1, 2. Denote exp(α0 + xikβk0)

47

Page 48: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

as tk, k 6= j. It is easy to show that δ(0) = 0 and

∂δ(βj0)

∂βj0

∣∣∣∣βj0=0

=

√λ

2[E

tk(1−tk)xik

(1+tk)3

E

tk(1+tk)2

− E

tkxik

(1+tk)2

E

tk(1−tk)(1+tk)3

]2

[E

tk(1+tk)2

]3 [E

tkx2ik

(1+tk)2

E

tk(1+tk)2

−[E

tkxik

(1+tk)2

]2] + 2√nλ > 0.

Therefore,

∂P ∗d,j

∂βj0

∣∣∣∣βj0=0

= 0 and∂2P ∗

d,j

∂β2j0

∣∣∣∣βj0=0

> 0.

It means that P ∗d,j obtains a minimum value at βj0 = 0. Furthermore, there exists two

positive constant c1 and c2 such that δ(βj0) ≥ 0 for any βj0 ∈ [0, c1] and δ(βj0) ≤ 0 for

any βj0 ∈ [−c2, 0]. Thus, ∂P ∗d,j/∂βj0 ≥ 0 for any βj0 ∈ [0, c1] and ∂P ∗

d,j/∂βj0 ≤ 0 for any

βj0 ∈ [−c2, 0]. In other words, P ∗d,j is an increasing function of βj0 if 0 < βj0 < c1 and P ∗

d,j is

a decreasing function of βj0 if −c2 < βj0 < 0.

Second, we consider the case where (xi, yi) follows a Poisson regression model, that is,

P (yi = y|xi) =λyi

y!exp(−λi),

where λi = E(yi|xi) = exp(α0+x⊤i β0). By calculation, D0,ii = λi andD0 = diagλ1, . . . , λn.

Assume p = 2, xi1 and xi2 are independent, E(xij) = 0 and Var(xij) = 1, j = 1, 2. Denote

exp(α0 + xikβk0) as tk, k 6= j. Then

∂P ∗d,j

∂βj0=

f1jφ

(−√nλ− βj0

√nλ

f1j

)δ(βj0),

48

Page 49: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

with

δ(βj0) =

(√f1jnλ

− βj0

2√

nλf1j

∂f1jβj0

)exp

(2βj0nλ√

f1j

)− 1

,

f1j =λE exp(xijβj0)

E(tk)[Eexp(xijβj0)x2

ij

E exp(xijβj0) − [E exp(xijβj0)xij]2

] ,

and

∂f1j∂βj0

=2λE exp(xijβj0)Eexp(xijβj0)xijE

exp(xijβj0)x

2ij

E(tk)[Eexp(xijβj0)x2

ij

E exp(xijβj0) − [E exp(xijβj0)xij]2

]2

− λ [E exp(xijβj0)xij]3 + λ [E exp(xijβj0)]2 Eexp(xijβj0)x

3ij

E(tk)[Eexp(xijβj0)x2

ij

E exp(xijβj0) − [E exp(xijβj0)xij]2

]2 .

In particular, if xij follows the standard normal distribution, then

∂P ∗d,j

∂βj0=nE(tk) exp(β

2j0/2)φ

[−√nλ− βj0

√nE(tk) exp(β

2j0/2)

]

×

√√√√ 1

nE(tk) exp(β2j0/2)

+β2j0

2√

nE(tk) exp(β2j0/2)

×[exp

2βj0n

√λE(tk) exp(β

2j0/2)

− 1].

Obviously, ∂P ∗d,j/∂βj0 > 0 if βj0 > 0, ∂P ∗

d,j/∂βj0 = 0 if βj0 = 0 and ∂P ∗d,j/∂βj0 < 0 if βj0 < 0.

Thus, P ∗d,j is an increasing function of βj0 if βj0 > 0 and P ∗

d,j is a decreasing function of βj0

if βj0 < 0.

49

Page 50: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

S3. Proof for Theorem 1

According to (2.4) in the main paper, the objective function about β for the one-step adaptive

lasso estimator is

Q(β) =1

2n(β − β(0))⊤X⊤D†(0)X(β − β(0)) +

p∑

j=1

λ|βj||β(0)

j |.

For βj ≈ β(1)j , Q(β) can be approximated by

1

2n(β − β(0))⊤X⊤D†(0)X(β − β(0)) +

p∑

j=1

λ|β(1)

j ||β(0)

j |+

1

2

p∑

j=1

λ

|β(0)j ||β(1)

j |β2

j − (β(1)j )2

=L(β) +

p∑

j=1

λ|β(1)

j ||β(0)

j |+

1

2

p∑

j=1

λ

|β(0)j ||β(1)

j |β2

j − (β(1)j )2,

where L(β) = (β − β(0))⊤X⊤D†(0)X(β − β(0))/(2n).

It can be shown easily that there exists a β(1)A

that is a√n-consistent local minimizer of

Q(β⊤A, 0⊤

A c)⊤ and satisfies the following condition:

∂Q(β)

∂βj

∣∣∣∣∣β=

(

β(1)A

0A c

)

= 0 for j = 1, . . . , q,

where A = j : βj0 6= 0, j = 1, . . . , p and A c = j : βj0 = 0, j = 1, . . . , p. Without loss of

generality, assume A = 1, . . . , q and q ≤ p.

50

Page 51: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Note that β(1)A

is a consistent estimator, then

∂L(β)

∂βj

∣∣∣∣∣β=

(

β(1)A

0A c

)

|β(0)j ||β(1)

j |β(1)j

=∂L(β)

∂βj

∣∣∣∣∣β=

(

β(1)A

0A c

)

|β(0)j |

sgn(β(1)j )

=∂L(β0)

∂βj+

q∑

ℓ=1

∂2L(β0)

∂βj∂βℓ+ op(1)

(1)ℓ − βℓ0)

|β(0)j |

sgn(βj0) +λ

|β(0)j ||β(1)

j |(β

(1)j − βj0) = 0.

(S3.1)

Denote X⊤D†(0)X as Z(0), then according to (S3.1),

√n

1

nZ

(0)A

+ Σλ(β(0)A,β

(1)A)

×[β

(1)A

− β0,A +

1

nZ

(0)A

+ Σλ(β(0)A,β

(1)A)

−1

b(β0,A ,β(0)A)

]

=−√n∂L(β0)

∂βA

=1√nZ

(0)A(β

(0)A

− β0,A ),

(S3.2)

where Σλ(β(0)A,β

(1)A) = diagλ/(|β(0)

1 ||β(1)1 |), . . . , λ/(|β(0)

q ||β(1)q |) and b(β0,A ,β

(0)A)

= (λ× sgn(β10)/|β(0)1 |, . . . , λ× sgn(βq0)/|β(0)

q |)⊤. According to the Central Limit Theorem,

√n(β

(0)A

− β0,A )D→ N (0, (I0,B)−1A ), where B = k : γk0 6= 0, k = 1, . . . , p+ 1. Further-

more, according to the Slutsky’s Theorem, the asymptotic bias of β(1)A

is

bias(β(1)A) = −

1

nZ0,A + Σλ(β0,A ,β0,A )

−1

b(β0,A ,β0,A ),

51

Page 52: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

where Z0 = E(X⊤D†0X). The asymptotic covariance matrix of β

(1)A

is

cov(β(1)A) =

1

n3

1

nZ0,A + Σλ(β0,A ,β0,A )

−1

Z0,A (I0,B)−1A Z0,A

×1

nZ0,A + Σλ(β0,A ,β0,A )

−1

.

If λ → 0 as n goes to infinity, then bias(β(1)A) → 0 and ncov(β

(1)A) → (I0,B)−1A .

If n is finite, then the bias of β(1)A

can not be ignored and An is not necessarily equal to

A . Without loss of generality, assume An = j : β(1)j 6= 0, j = 1, . . . , p = 1, . . . , s. Then

Bn = k : γ(1)k 6= 0, k = 1, . . . , p + 1 = 1, . . . , s + 1. Furthermore, the estimators of bias

and covariance matrix of β(1)An

are given by

bias(β(1)An) = −

1

nZ

(0)An

+ Σλ(β(0)An,β

(1)An

)

−1

b(β(1)An

,β(0)An)

and

cov(β(1)An

) =1

n3

1

nZ

(0)An

+ Σλ(β(0)An

,β(1)An)

−1

Z(0)An

(I(0)Bn

)−1AnZ

(0)An

×1

nZ

(0)An

+ Σλ(β(0)An

,β(1)An)

−1

,

where Σλ(β(0)An,β

(1)An) = diagλ/(|β(0)

1 ||β(1)1 |), . . . , λ/(|β(0)

s ||β(1)s |) and b(β

(1)An

,β(0)An) = (λ ×

sgn(β(1)1 )/|β(0)

1 |, . . . , λ× sgn(β(1)s )/|β(0)

s |)⊤.

52

Page 53: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

S4. Implementation details of several methods

In this section, we introduce the implementation details of several methods mentioned in the

main paper.

S4.1 One-step adaptive lasso estimator

To obtain the one-step adaptive lasso estimator, we use the function glmnet in R to solve

(2.5). The selection of tuning parameter λ is important. In finite samples, if λ is too large,

the bias of the one-step adaptive lasso estimator will be large and the coverage probability of

the confidence interval constructed based on the asymptotic theory for the one-step adaptive

lasso estimator will be low; if λ is too small, the number of false positives will be large and

the width of the confidence interval will also be large. The Bayesian information criterion

(BIC) and cross-validation (CV) method are two commonly used tuning parameter selection

methods. Based on the simulation results, λ selected based on the Bayesian information

criterion proposed by Wang and Leng (2007) is much larger than the value of λ selected by

the 5-fold cross-validation method. Denote the values of λ selected by these two methods as

λBIC and λCV, respectively. We choose λ to be (λBIC + λCV)/2 as a trade-off of these two

methods.

S4.2 Estimating equation-based method

In our simulation studies and real-data application, we compare the proposed method with an

estimating equation-based method, which is proposed by Neykov et al. (2018) and denoted

as “EstEq.” We apply their method based on Algorithm 1 in their paper. Using the same

notations as in our paper, the implementation details are as follows:

53

Page 54: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Step 1: Use the R functions gds and cv gds to get the generalized Dantzig selector of the

regression coefficient γ0 = (α0,β⊤0 )

⊤ in a logistic regression model and denote the

estimator as γ. That is, solve the following optimization problem to obtain an estimate

γ:

γ =argmin ‖γ‖1,

subject to ‖t(γ)‖ =

∥∥∥∥∥−1

n

n∑

i=1

∂ℓi(γ)

∂γ

∥∥∥∥∥∞

=

∥∥∥∥∥−1

n

n∑

i=1

yi − pi(γ)xi

∥∥∥∥∥∞

≤ λ,

where ℓi(γ) is the conditional log-likelihood function of yi given xi for a logistic regres-

sion model and pi(γ) = exp(x⊤i γ)/1+exp(x⊤

i γ), i = 1, . . . , n. The tuning parameter

of the generalized Dantzig selector, λ, is selected by the 10-fold cross-validation method.

Step 2: Calculate the inverse ofT(γ) = ∂t(γ)/∂γ⊤ = X⊤D(γ)X/n, whereD(γ) = diagp1(γ)(1−

p1(γ)), . . . , pn(γ)(1− pn(γ)). Denote the inverse of T(γ) as Ω. Define the projection

direction for the jth element of β0, βj0, as vj = Ω(j+1)., where Ω(j+1). is the (j + 1)th

row element of Ω. Note that in Neykov et al. (2018), the authors used the CLIME

estimator to estimate the inverse of T(γ). However, in our problem, we assume n > p

and p is fixed, then the inverse of T(γ) can be calculated directly.

Step 3: Use the R function uniroot to solve the sparse projected test function and denote the

estimated value of βj0 as βj .

Step 4: Construct a two-sided 100(1− α)% confidence interval for βj0 as

CIj =(βj − Φ−1(1− α/2)σj/

√n, βj + Φ−1(1− α/2)σj/

√n),

54

Page 55: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

where σ2j = v⊤

j X⊤D(γ)Xvj/n.

S4.3 Two types of bootstrap de-biased lasso methods

Motivated by the idea of Dezeure et al. (2017), we establish two xy-paired bootstrap de-

biased lasso methods, which are referred to as “the type-I bootstrap de-biased lasso method”

and “the type-II bootstrap de-biased lasso method,” respectively. The bootstrap de-biased

lasso method is based on the de-biased lasso method proposed by Zhang and Zhang (2014),

Van de Geer et al. (2014) and Javanmard and Montanari (2014). Following the idea of

Dezeure et al. (2017), the procedure for the type-I bootstrap de-biased lasso method is as

follows:

(i) Based on the original data points (X1, Y1), . . . , (Xn, Yn), calculate the lasso estimator

and de-biased lasso estimator of the jth element of β0, βj0. Denote them as bj and βj ,

respectively. Calculate the standard error of the de-biased lasso estimator, s.e.j .

(ii) Resample (X∗1, Y

∗1 ), . . . , (X

∗n, Y

∗n ) with replacement from (X1, Y1), . . . , (Xn, Yn) for B

times. For the kth bootstrap sample, calculate the de-biased lasso estimator b∗jk, the

standard error for the de-biased lasso estimator s.e.∗jk and T ∗jk = (b∗jk − βj)/s.e.

∗jk.

Denote the ν-quantile of T ∗j1, . . . , T

∗jB as q∗j;ν .

(iii) Construct a two-sided 100(1− α)% confidence interval for βj0 as

CIj =(bj − q∗j;1−α/2s.e.j , bj − q∗j;α/2s.e.j

).

In addition, the procedure for the type-II bootstrap de-biased lasso method is as follows:

55

Page 56: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

(i) Resample (X∗1, Y

∗1 ), . . . , (X

∗n, Y

∗n ) with replacement from (X1, Y1), . . . , (Xn, Yn) for B

times. For the kth bootstrap sample, calculate the de-biased lasso estimator of the jth

element of β0, βj0, which is denoted as b∗jk. Denote the ν-quantile of b∗j1, . . . , b∗jB as

q∗j;ν.

(iii) Construct a two-sided 100(1− α)% confidence interval for βj0 as

CIj =(q∗j;α/2, q

∗j;1−α/2

).

S5. Additional simulation results

In this section, we present additional simulation results under the simulation settings in

Section 5. Figures S1 and S2 display the results for different types of selection probability for

X4 when ρ = 0.2 and 0.5, respectively. Figures S3 and S4 present the empirical probabilities

of assigning the covariate X4 to different signal categories as the value of θ varies when

ρ = 0.2 and 0.5, respectively. Tables S1–S4 show the coverage probabilities and average

widths of the 95% confidence intervals under all simulation settings. Figures S5–S7 show

the simulation results for the proposed method when the threshold value δ1 varies. Figures

S8–S10 show the simulation results for the proposed method when the threshold value τ

varies. Figures S11–S13 show the simulation results for the proposed method when the total

number of weak signals varies.

S6. Additional information in real-data application

Table S5 shows the candidate predictors used in the real-data application.

56

Page 57: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 350, p = 25, ρ = 0.2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 350, p = 35, ρ = 0.2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 550, p = 25, ρ = 0.2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 550, p = 35, ρ = 0.2

Figure S1: Different types of selection probability for X4 when ρ = 0.2. Pdem: empirical selec-tion probability, which equals the empirical probability of θ(1) 6= 0 based on 500 Monte Carlosamples; Pdapproxi: approximated selection probability based on (3.1), where the expectations in(3.1) are calculated by using the function cubintegrate in R; Pdest: median of estimated selectionprobabilities based on (3.3) for 500 Monte Carlo samples.

57

Page 58: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 350, p = 25, ρ = 0.5

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

θ

P d

PdemPdapproxi

Pdest

n = 350, p = 35, ρ = 0.5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 550, p = 25, ρ = 0.5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P d

PdemPdapproxi

Pdest

n = 550, p = 35, ρ = 0.5

Figure S2: Different types of selection probability forX4 when ρ = 0.5. The meanings of notations:see Figure S1.

58

Page 59: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 350, p = 25, ρ = 0.2

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 350, p = 35, ρ = 0.2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 550, p = 25, ρ = 0.2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 550, p = 35, ρ = 0.2

Figure S3: Empirical probabilities of assigning the covariate X4 to different signal categories whenρ = 0.2.

59

Page 60: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 350, p = 25, ρ = 0.5

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 350, p = 35, ρ = 0.5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 550, p = 25, ρ = 0.5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

θ

P(4 ∈ S(S))P(4 ∈ S(W))P(4 ∈ S(N))

n = 550, p = 35, ρ = 0.5

Figure S4: Empirical probabilities of assigning the covariate X4 to different signal categories whenρ = 0.5.

60

Page 61: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Table S1: The coverage probabilities (%) of the 95% confidence intervals when the samplesize is n = 350.

p = 25 p = 35

θ Method ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5

0

Proposed 93.8 94.4 96.2 94.6 92.2 94.8OldTwostep 75.8 76.7 81.4 77.1 66.9 72.3

Asym 3.6 3.8 12.7 4.3 1.4 4.0MLE 93.8 94.4 96.2 94.6 92.2 94.8

Perturb 100.0 100.0 100.0 100.0 100.0 100.0EstEq 94.0 94.2 96.6 95.6 92.8 94.8SdBS 99.8 100.0 99.8 99.8 99.8 99.0SmBS 100.0 100.0 99.8 100.0 100.0 100.0DeLasso 95.8 96.0 98.2 96.4 95.2 96.4BSDe1 99.8 100.0 99.8 100.0 100.0 100.0BSDe2 94.8 94.4 96.2 95.4 91.8 94.4

0.3

Proposed 94.6 95.2 92.8 95.2 96.4 94.6OldTwostep 96.9 96.6 92.0 98.0 96.7 92.4

Asym 75.5 71.6 61.5 65.8 69.6 69.3MLE 92.2 93.4 92.6 92.4 92.0 93.6

Perturb 57.0 55.0 52.0 38.8 49.0 44.0EstEq 92.2 92.6 93.8 92.6 91.6 94.2SdBS 72.0 69.6 62.8 53.0 61.0 53.4SmBS 65.2 64.6 59.8 39.8 49.4 47.8DeLasso 93.8 94.0 92.8 93.0 93.4 95.0BSDe1 52.0 58.0 85.6 48.6 60.6 86.4BSDe2 94.2 94.6 95.0 96.2 95.0 95.2

0.95

Proposed 95.0 93.6 95.0 96.0 93.8 97.2OldTwostep 95.0 93.6 95.4 96.0 93.8 97.2

Asym 95.0 93.6 91.6 96.0 93.8 92.2MLE 90.0 91.6 91.2 87.8 87.8 86.8

Perturb 93.2 93.0 97.0 95.4 94.2 96.4EstEq 90.6 87.4 92.8 89.8 89.4 89.4SdBS 93.8 93.8 95.6 93.4 93.4 95.6SmBS 87.2 87.8 90.2 68.6 69.6 74.8DeLasso 87.6 87.6 90.4 90.4 84.2 89.6BSDe1 23.0 26.0 34.8 17.8 15.4 26.4BSDe2 94.8 95.6 97.4 94.4 95.0 95.6

Note: Proposed: the proposed two-step inference method; OldTwostep: the two-step inference methodbased on Shi and Qu (2017), which does not construct confidence intervals for identified noise variables;Asym: the method based on the asymptotic theory using the one-step adaptive lasso estimator; MLE:the maximum likelihood estimation method; Perturb: the perturbation method; EstEq: the estimatingequation-based method; SdBS: the standard bootstrap method; SmBS: the smoothed bootstrap method;DeLasso: the de-biased lasso method; BSDe1: the type-I bootstrap de-biased lasso method; BSDe2: thetype-II bootstrap de-biased lasso method.

61

Page 62: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Table S2: The coverage probabilities (%) of the 95% confidence intervals when the samplesize is n = 550.

p = 25 p = 35

θ Method ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5

0

Proposed 95.4 94.8 95.4 94.6 94.2 95.8OldTwostep 81.7 77.6 80.0 75.9 76.4 78.9

Asym 4.2 7.6 7.2 1.4 4.2 7.1MLE 95.4 94.8 95.4 94.6 94.2 95.8

Perturb 99.8 100.0 100.0 100.0 100.0 100.0EstEq 95.6 93.8 95.6 95.2 95.0 96.8SdBS 99.8 99.6 99.6 100.0 100.0 100.0SmBS 99.8 100.0 100.0 100.0 100.0 100.0DeLasso 96.6 95.4 97.0 96.4 95.8 97.4BSDe1 99.8 100.0 99.8 100.0 100.0 100.0BSDe2 95.4 94.6 95.6 95.8 94.2 95.6

0.25

Proposed 94.4 95.6 95.0 95.4 93.8 95.6OldTwostep 95.8 96.6 94.8 97.0 95.1 94.7

Asym 69.4 63.8 68.2 72.3 69.9 68.5MLE 94.4 95.6 94.4 93.8 92.0 95.2

Perturb 57.4 52.8 56.2 54.8 55.2 54.6EstEq 93.6 95.0 93.8 93.4 91.4 94.8SdBS 68.8 65.2 62.8 65.0 66.8 62.0SmBS 67.8 66.0 63.6 61.6 62.8 64.4DeLasso 93.0 94.8 94.4 94.0 93.0 95.8BSDe1 52.8 57.2 79.2 49.2 57.4 79.6BSDe2 94.2 96.4 94.8 95.2 96.0 96.0

0.8

Proposed 94.2 94.4 93.8 95.0 95.0 92.2OldTwostep 94.2 94.4 93.8 95.0 95.0 92.2

Asym 94.2 94.4 90.6 95.0 95.0 89.0MLE 93.6 94.4 92.6 90.4 89.4 91.2

Perturb 90.2 93.0 97.0 93.8 94.2 95.8EstEq 92.4 93.0 90.6 90.4 92.4 91.6SdBS 91.2 93.8 96.2 91.8 91.6 94.2SmBS 88.4 93.8 94.4 87.0 86.0 91.2DeLasso 87.0 90.2 89.4 89.0 87.2 90.4BSDe1 23.0 26.0 41.2 15.8 18.8 33.8BSDe2 96.4 97.2 94.4 93.8 95.8 95.2

Note: Proposed: the proposed two-step inference method; OldTwostep: the two-step inference methodbased on Shi and Qu (2017), which does not construct confidence intervals for identified noise variables;Asym: the method based on the asymptotic theory using the one-step adaptive lasso estimator; MLE:the maximum likelihood estimation method; Perturb: the perturbation method; EstEq: the estimatingequation-based method; SdBS: the standard bootstrap method; SmBS: the smoothed bootstrap method;DeLasso: the de-biased lasso method; BSDe1: the type-I bootstrap de-biased lasso method; BSDe2: thetype-II bootstrap de-biased lasso method.

62

Page 63: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Table S3: The widths (×100) of the 95% confidence intervals when the sample size is n = 350

p = 25 p = 35

θ Method ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5

0

Proposed 55.7 60.0 78.4 58.2 62.7 82.1OldTwostep 55.9 60.8 79.7 58.6 63.2 82.8

Asym 19.6 21.6 22.3 19.7 18.9 23.3MLE 55.7 60.0 78.4 58.2 62.8 82.1

Perturb 14.5 14.7 17.6 10.3 11.1 13.9EstEq 50.4 53.9 70.1 51.1 54.7 71.0SdBS 22.9 23.6 27.9 17.2 17.9 21.7SmBS 16.6 16.8 19.4 11.4 11.9 14.3DeLasso 48.7 51.9 66.8 49.4 52.6 67.5BSDe1 49.6 52.8 67.7 50.6 54.0 68.9BSDe2 58.7 63.2 82.8 63.6 69.0 90.6

0.3

Proposed 56.2 60.5 79.5 58.6 63.1 83.8OldTwostep 56.2 60.6 79.1 58.6 63.0 83.9

Asym 33.5 34.0 35.0 30.2 32.8 35.8MLE 57.0 61.6 80.7 59.5 64.5 84.9

Perturb 49.6 51.7 55.9 40.5 47.1 50.3EstEq 51.0 54.8 71.6 51.6 55.4 72.5SdBS 51.6 53.4 58.4 41.1 45.8 49.5SmBS 46.0 47.4 50.1 34.6 39.1 40.8DeLasso 49.4 52.9 68.3 49.7 53.2 68.6BSDe1 51.2 54.9 70.3 52.7 56.4 72.5BSDe2 62.8 67.6 88.0 68.8 74.8 98.5

0.95

Proposed 60.9 63.9 73.4 62.0 64.9 75.1OldTwostep 60.9 63.9 73.3 62.0 64.9 75.1

Asym 60.9 63.8 71.0 62.0 64.8 71.8MLE 68.6 73.7 93.7 72.9 78.1 100.5

Perturb 67.4 70.4 91.6 71.1 76.1 103.2EstEq 57.4 61.6 78.9 57.9 61.6 79.3SdBS 67.6 70.4 87.4 67.2 70.4 86.4SmBS 60.8 63.6 79.7 57.9 61.0 75.9DeLasso 53.5 56.8 72.9 53.6 57.0 73.2BSDe1 56.0 60.2 77.7 58.0 61.8 80.6BSDe2 84.5 91.8 115.8 100.8 108.1 137.6

Note: Proposed: the proposed two-step inference method; OldTwostep: the two-step inference methodbased on Shi and Qu (2017), which does not construct confidence intervals for identified noise variables;Asym: the method based on the asymptotic theory using the one-step adaptive lasso estimator; MLE:the maximum likelihood estimation method; Perturb: the perturbation method; EstEq: the estimatingequation-based method; SdBS: the standard bootstrap method; SmBS: the smoothed bootstrap method;DeLasso: the de-biased lasso method; BSDe1: the type-I bootstrap de-biased lasso method; BSDe2: thetype-II bootstrap de-biased lasso method.

63

Page 64: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Table S4: The widths (×100) of the 95% confidence intervals when the sample size is n = 550

p = 25 p = 35

θ Method ρ = 0 ρ = 0.2 ρ = 0.5 ρ = 0 ρ = 0.2 ρ = 0.5

0

Proposed 42.7 45.9 59.9 43.7 47.0 61.5OldTwostep 42.8 46.2 60.3 44.0 47.2 61.6

Asym 14.8 15.3 17.0 13.7 14.7 17.0MLE 42.7 45.9 59.9 43.7 47.0 61.5

Perturb 12.6 13.3 17.1 9.7 10.8 12.7EstEq 39.8 42.7 55.5 40.1 42.8 55.8SdBS 19.4 19.9 25.2 16.0 17.3 20.3SmBS 15.1 15.3 19.4 11.8 12.9 14.7DeLasso 38.6 41.2 53.2 38.8 41.4 53.5BSDe1 38.8 41.4 53.1 39.1 41.6 53.6BSDe2 43.0 46.1 60.3 44.6 48.0 62.9

0.25

Proposed 42.7 46.2 60.6 43.7 47.2 62.3OldTwostep 42.7 46.2 60.6 43.7 47.2 62.0

Asym 25.7 25.2 28.9 25.8 26.4 27.5MLE 43.4 46.7 61.2 44.5 48.0 62.9

Perturb 40.7 41.7 47.7 39.1 41.2 46.2EstEq 40.2 43.1 56.3 40.4 43.3 56.7SdBS 42.4 43.8 49.8 40.0 41.7 47.8SmBS 40.2 41.4 46.0 37.3 39.0 43.6DeLasso 39.0 41.7 54.0 39.2 41.7 54.2BSDe1 39.9 42.7 54.8 40.4 43.5 55.7BSDe2 45.1 48.3 62.8 47.3 51.0 66.5

0.8

Proposed 45.5 47.8 54.9 46.1 48.1 54.8OldTwostep 45.5 47.8 54.9 46.1 48.1 54.8

Asym 45.5 47.8 53.6 46.1 48.1 53.6MLE 49.4 53.1 68.0 51.1 54.7 70.2

Perturb 50.5 53.3 69.3 51.5 53.5 70.2EstEq 43.9 47.1 60.8 44.2 47.2 60.9SdBS 49.3 52.0 66.2 48.9 50.9 64.4SmBS 48.9 51.6 65.8 47.3 49.6 63.2DeLasso 41.4 44.2 56.8 41.6 43.9 57.2BSDe1 42.9 45.8 59.2 43.3 46.7 60.4BSDe2 54.6 58.6 74.9 59.0 63.2 81.5

Note: Proposed: the proposed two-step inference method; OldTwostep: the two-step inference methodbased on Shi and Qu (2017), which does not construct confidence intervals for identified noise variables;Asym: the method based on the asymptotic theory using the one-step adaptive lasso estimator; MLE:the maximum likelihood estimation method; Perturb: the perturbation method; EstEq: the estimatingequation-based method; SdBS: the standard bootstrap method; SmBS: the smoothed bootstrap method;DeLasso: the de-biased lasso method; BSDe1: the type-I bootstrap de-biased lasso method; BSDe2: thetype-II bootstrap de-biased lasso method.

64

Page 65: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

δ1 = 0.96δ1 = 0.97δ1 = 0.98δ1 = 0.99

P(4 ∈ S(S))

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

P(4 ∈ S(W))

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

P(4 ∈ S(N))

Figure S5: Empirical probabilities of assigning the covariate X4 to different signal categories when(n, p, ρ) = (350, 25, 0), τ = 0.1 and the threshold value δ1 varies.

65

Page 66: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.80

0.85

0.90

0.95

1.00

θ

Cov

erag

e P

roba

bilit

y

δ1 = 0.96δ1 = 0.97δ1 = 0.98δ1 = 0.99

Figure S6: Coverage probabilities of the 95% confidence intervals for the proposed two-step infer-ence method when (n, p, ρ) = (350, 25, 0), τ = 0.1 and the threshold value δ1 varies.

66

Page 67: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.52

0.54

0.56

0.58

0.60

0.62

θ

Wid

th

δ1 = 0.96δ1 = 0.97δ1 = 0.98δ1 = 0.99

Figure S7: Average widths of the 95% confidence intervals for the proposed two-step inferencemethod when (n, p, ρ) = (350, 25, 0), τ = 0.1 and the threshold value δ1 varies.

67

Page 68: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

τ = 0.05τ = 0.1τ = 0.15τ = 0.2

P(4 ∈ S(S))

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

P(4 ∈ S(W))

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

P(4 ∈ S(N))

Figure S8: Empirical probabilities of assigning the covariate X4 to different signal categories when(n, p, ρ) = (350, 25, 0), δ1 = 0.99 and the threshold value τ varies.

68

Page 69: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.80

0.85

0.90

0.95

1.00

θ

Cov

erag

e P

roba

bilit

y

τ = 0.05τ = 0.1τ = 0.15τ = 0.2

Figure S9: Coverage probabilities of the 95% confidence intervals for the proposed two-step infer-ence method when (n, p, ρ) = (350, 25, 0), δ1 = 0.99 and the threshold value τ varies.

69

Page 70: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.54

0.56

0.58

0.60

0.62

θ

Wid

th

τ = 0.05τ = 0.1τ = 0.15τ = 0.2

Figure S10: Average widths of the 95% confidence intervals for the proposed two-step inferencemethod when (n, p, ρ) = (350, 25, 0), δ1 = 0.99 and the threshold value τ varies.

70

Page 71: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

q = 0q = 1q = 2q = 3

P(4 ∈ S(S))

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

P(4 ∈ S(W))

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

θ

P(4 ∈ S(N))

Figure S11: Empirical probabilities of assigning the covariate X4 to different signal categorieswhen (n, p, ρ) = (350, 25, 0), δ1 = 0.99, τ = 0.1 and the total number of weak signals varies.

71

Page 72: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.80

0.85

0.90

0.95

1.00

θ

Cov

erag

e P

roba

bilit

y

q = 0q = 1q = 2q = 3

Figure S12: Coverage probabilities of the 95% confidence intervals for the proposed two-stepinference method when (n, p, ρ) = (350, 25, 0), δ1 = 0.99, τ = 0.1 and the total number of weaksignals varies.

72

Page 73: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

0.0 0.2 0.4 0.6 0.8 1.0

0.54

0.56

0.58

0.60

0.62

θ

Wid

th

q = 0q = 1q = 2q = 3

Figure S13: Average widths of the 95% confidence intervals for the proposed two-step inferencemethod when (n, p, ρ) = (350, 25, 0), δ1 = 0.99, τ = 0.1 and the total number of weak signals varies.

73

Page 74: Annie Qu arXiv:2108.07940v2 [stat.ME] 3 Nov 2021

Table S5: The candidate predictors used in the real-data analysis

Category Predictor

Basic informationyear of birthgender3 predictors indicating whether a patient is from California, Texas, New York or other states

Transcript records

range of BMIthe median of weightsthe median of heightsthe median of systolic blood pressuresthe medians of Diastolic blood pressuresthe median of respiratory ratesthe median of temperatures4 predictors corresponding to the numbers of transcripts for different physician specialtiesnumber of physiciansnumber of transcripts with blank visit yearnumber of visits per weighted year

Diagnosis information

69 predictors corresponding to the numbers of times being diagnosed with different diagnosesnumber of diagnoses per weighted yearnumber of different 3 digits diagnostics groups in the icd9 tablenumber of different 3 digits diagnostics groups with medication

Medication information

23 predictors indicating the dose of active principlenumber of prescriptions or the use of different medicationsnumber of medications without prescriptnumber of active principles

Lab result 1 binary variable indicating whether a patient has any lab test or not

Smoking status 1 binary variable indicating whether a patient smoked in the past

74


Recommended