The Lasso with Nearly Orthogonal Latin Hypercube
Designs
(February 19, 2013)
Xinwei Deng
Department of Statistics
Virginia Tech
C. Devon Lin
Department of Mathematics and Statistics
Queen’s University, Canada
Peter Z. G. Qian
Department of Statistics
University of Wisconsin-Madison
Abstract
We consider the Lasso problem when the input values need to take multiple lev-
els. In this situation, we propose to use nearly orthogonal Latin hypercube designs,
originally motivated by computer experiments, to significantly enhance the variable
selection accuracy of the Lasso. The use of such designs ensures small column-wise
correlations in variable selection and gives flexibility in identifying nonlinear effects
in the model. Systematic methods for constructing such designs are presented. The
effectiveness of the proposed method is illustrated by several numerical examples.
Keywords: Design of experiments; Nearly orthogonal design; Orthogonal array; Space-
filling design; Variable selection.
1 INTRODUCTION
Experiments with a large number of predictor variables, or called factors in experimental
design, are now widely used. In such an experiment often only a few of the predictor variables
have significant impact on the response. Identifying those significant factors is known as
factor screening in the experimental design literature (Box, Hunter, and Hunter 2005; Yuan,
Joseph, and Lin, 2007). For example, Lin and Sitter (2008) reported an experiment from
Los Alamos National Laboratory where only 5-7 out of 53 predictor variables are important.
1
Factor screening is a variable selection problem with the advantage that the input values can
be chosen. In the variable selection literature, the Lasso (l1-penalized regression) is a very
popular method (Tibshirani, 1996). This method can be described as follows. Consider the
linear model
y =
p∑j=1
xjβj + ε (1)
where x = (x1, . . . , xp) is the vector of p predictor variables. Here y is the response variable,
β = (β1, . . . , βp) is the vector of regression parameters, and ε is a normal random variable
with mean zero and variance σ2. Suppose the model in (1) has a sparse structure in that
only p0 entries of β are non-zero and the active set
A(β) = {j : βj 6= 0, j = 1, . . . , p} (2)
has p0 elements. Assume, throughout, the response y is centered so that the model in (1)
does not include an intercept. For a given design matrix X containing n rows, x1, . . . ,xn,
in p predictor variables and a given response vector y = (y1, . . . , yn), the Lasso estimate of
β for the model in (1) is
β̂ = arg minβ
[(y −Xβ)T (y −Xβ) + λ‖β‖1
], (3)
where λ ≥ 0 is a tuning parameter and ‖β‖1 is the l1 norm of β defined by ‖β‖1 =∑p
i=1 |βi|.
Because the l1 norm is singular at the origin, some entries of β̂ can be exactly zero given
a properly chosen λ value. A larger value of λ will result in more zero entries in β̂. The
non-zero entries of β̂ in (3) are used for the recovery of the sparse structure of the model in
(1). Therefore, the active set A(β) in (2) can be estimated by
A(β̂) = {j : β̂j 6= 0, j = 1, . . . , p}.
The Lasso solution β̂ in (3) might not be perfect in identifying the true active variables. A
false selection of β̂ can be a false positive or a false negative, where a false positive occurs if
j ∈ A(β̂) but j /∈ A(β) and a false negative takes place if j /∈ A(β̂) but j ∈ A(β). Denote
the number of false selections by γ, i.e.
γ =∑
j
I{j ∈ A(β̂) but j /∈ A(β)}+∑
j
I{j /∈ A(β̂) but j ∈ A(β)}, (4)
2
where I(·) is an indicator function.
Statistical properties, computation and extensions of the Lasso method have been actively
studied in recent years (Tibshirani, 2011). In this article, we study the Lasso from an
experimental design perspective. Both two-level designs and multi-level designs are used
widely in practice (Wu and Hamada, 2009) for different situations depending on the specific
need. In some examples, predictors like two choices of cooling materials are limited to two
levels because of the nature or restriction of the experiment. In other examples, predictors
like pressure and temperature are continuous in nature and need to take multiple levels. It
is hard to directly compare multi-level designs and two-level designs for a given problem.
There are some optimality theory associated with two-level designs (Fedorov and Hackl,
1997). It is not clear whether these results apply to the Lasso problem, especially for the
p > n situation.
Here we consider the Lasso under a multi-level design, which is an n×p matrix X in (3).
This is the situation that each predictor variable needs to take multiple levels. Compared
with a two-level design with the points on the boundary, a multi-level design places points
in the interior and on the boundary of the design space. Multi-level designs can detect some
nonlinear effects in the data (Box and Draper, 1959) and allow certain model inadequacy.
Our motivation is that the structure of X can greatly affect the number of false selections γ in
(4). Obviously, if the columns of X are generated with a dependent structure, it can result
in large pairwise correlations. That is, the collinearity among predictor variables causes
inaccurate variable selection, which results in a large value of γ. Therefore, it is important
to reduce the linear dependency among columns when generating the design matrix X. A
better solution one would suggest is to generate these columns independently using the
simple random sampling method. Although the columns of X generated by this scheme are
stochastically independent, the largest pairwise sample correlations are not guaranteed to
be small especially when the number of predictor variables p is larger than the sample size
n. For illustration, consider a 64× 192 matrix generated from U [0, 1) by using the random
sampling method. Figure 1 depicts the histogram of the sample correlations of the columns
of X, where, in absolute values, 43% of these correlations are larger than 0.1.
3
sample correlation
Density
-0.4 -0.2 0.0 0.2 0.4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 1. Histogram of the column-wise correlations of a 64 × 192 design generated by the
random sampling from U [0, 1).
To improve upon the simple random sampling method of generating the design matrix
X in (3), we propose to take X to be a nearly orthogonal Latin hypercube design (NOLHD),
originally motivated by computer experiments (Sacks et al., 1989) and used here in the
new application of the Lasso. Such a design is a Latin hypercube design with guaranteed
small column-wise sample correlations. An NOLHD (i) is space-filling in terms of one-
dimensional stratification, and (ii) pursues low correlations among the columns. Because of
the first property, the least squares estimate of an additive model under a Latin hypercube
design has smaller variance than that with a random sample (Owen 1992, Section 3). The
Lasso solution β̂ in (3) inherits this advantage since the model in (1) is additive. Because
of the second property, the Lasso with an NOLHD performs better than its counterpart
with a random Latin hypercube design that can have large column-wise sample correlations
(McKay, Beckman, and Conover, 1979). Furthermore, the sparsity of the model in (1) and
the concept of an NOLHD is seamlessly connected: when X in (3) is an NOLHD in p factors,
its projection onto any p0 active factors gives a smaller NOLHD in p0 factors.
4
The remainder of the article is organized as follows. Section 2 presents two systematic
methods of constructing NOLHDs. The first method is taken from Lin, Mukerjee, and Tang
(2009). The second method is a generalization of the method in Lin, Bingham, Sitter, and
Tang (2010). A new criterion to measure the near orthogonality of an NOLHD for the
Lasso is also introduced in this section. Section 3 provides numerical examples to bear out
the advantage of NOLHDs for the Lasso over random Latin hypercube designs and random
samples. We provide some discussion in Section 4, including the comparison of several
popular optimal two-level designs.
2 METHODOLOGY
In this section, we discuss the construction of NOLHDs and define a new criterion to measure
the orthogonality of NOLHDs for the Lasso problem. Here are some notation and definitions.
Let [ ] denote juxtaposing matrices of the same number of rows column by column. The
Kronecker product of an n× p matrix A = (aij) and an m× q matrix B = (bij) is
A⊗B =
a11B a12B . . . a1pB
a21B a22B . . . a2pB...
.... . .
...
an1B an2B . . . anpB
,where aijB is an m × q matrix with the (k, l) entry aijbkl. The (i, j)th entry of the p × p
correlation matrix ρ of an n× p matrix X = (xij) is
ρij =
∑nk=1(xki − x̄i)(xkj − x̄j)√∑
k(xki − x̄i)2∑
k(xkj − x̄j)2, (5)
which represents the Pearson correlation between columns i and j with x̄i = n−1∑n
k=1 xki
and x̄j = n−1∑n
k=1 xkj. If ρ is an identity matrix, X is orthogonal.
Construction of a random Latin hypercube design (RLHD) Z = (zij) on [0, 1)p of n
runs for p factors consists of (i) generating a matrix D = (dij) with columns being random
permutations of 1, . . . , n, and (ii) using
zij =dij − uij
n, i = 1, . . . , n, j = 1, . . . , p, (6)
5
where the uij are independent U[0,1) random variables, and the dij and the uij are mutually
independent. To define Z on [a, b)p for general a < b, rescale zij in (6) to
(b− a)zij + a. (7)
An NOLHD is a Latin hypercube design with small column-wise correlations. Though
for some values of n and p such designs with exactly orthogonal columns are available,
they exist more generally under the nearly orthogonal condition even for n ≥ p. Popular
criteria for defining the (near) orthogonality of a matrix include the maximum correlation
ρm = maxi,j|ρij| and the root average squared correlation ρave = {∑
i<j ρ2ij/[p(p − 1)/2]}1/2
(Bingham, Sitter, and Tang, 2009), where ρij is given in (5). For the Lasso problem here,
we introduce a new criterion, called the correlation percentage vector. For an integer q and
a vector t = (t1, . . . , tq) with 1 ≥ t1 > t2 > · · · > tq ≥ 0, the correlation percentage vector of
a design X is
δ(X) = (δt1 , . . . , δtq), (8)
where δtk = {p(p− 1)/2}−1∑p
i=1
∑j>i I(|ρij| ≤ tk), for k = 1, . . . , q and I(·) is an indicator
function. Elements of the vector t are generally chosen to be relatively small, say in the
range of (0.1,0.005), to reflect the near orthogonality of a design. For k = 1, . . . , q, δtk
computes the proportion of the |ρij| not exceeding tk. For two designs X1 and X2 of the
same size, X1 is preferred over X2 if δtk(X1) > δtk(X2) for k = 1, . . . , q. This criterion is
more discriminating than the maximum correlation criterion and the root average squared
correlation criterion. Designs with similar ρm or ρave values can have different δ values.
For example, a 64 × 192 matrix X1 generated by the simple random sampling method and
an NOLHD X2 of the same size constructed in Section 2.1 have ρave = 0.124 and 0.112,
respectively, and thus are indistinguishable. But for (t1, t2, t3, t4) = (0.1, 0.05, 0.01, 0.005),
δ(X1) = (0.562, 0.305, 0.064, 0.033) and δ(X2) = (0.906, 0.894, 0.883, 0.883), indicating the
superiority of X2. Examples in Section 3 will demonstrate that the Lasso under designs with
larger correlation percentage vectors appears to give more accurate variable selection.
Sections 2.1 and 2.2 present two easy-to-implement methods for constructing NOLHDs.
The first is taken from Lin, Mukerjee, and Tang (2009) and the second is a generalization
6
of the method in Lin et al. (2010). To assist readers unfamiliar with experimental design,
we describe these methods in a self-contained fashion. Other construction methods for
NOLHDs can be found in Ye (1998), Steinberg and Lin (2006), Pang, Liu, and Lin (2009),
and Sun, Liu, and Lin (2009, 2010), among others. Hereafter, let NOLHD(n, p) denote an
NOLHD with n rows and p columns, where the n levels in each column is taken to be
{−(n− 1)/2,−(n− 3)/2, . . . , (n− 3)/2, (n− 1)/2}.
2.1 A Construction Method Using Orthogonal Arrays
Lin et al. (2009) proposed a method for constructing nearly orthogonal Latin hypercubes
using orthogonal arrays. An orthogonal array OA(n, p, s) of strength two is an n× p matrix
with levels 1, . . . , s such that, for any two columns, all level combinations appear equally
often (Hedayat, Sloane, and Stufken, 1999). Let A be an OA(s2, 2f, s) and let B = (bij) be
an s× p Latin hypercube. This method consists of three steps.
Step 1. For j = 1, . . . , p, obtain an s2 × (2f) matrix Aj from A by replacing the symbols
1, . . . , s in A by b1j, . . . , bsj, respectively, and partition Aj to Aj1, . . . ,Ajf , each of two
columns.
Step 2. Let
V =
[1 −ss 1
].
For j = 1, . . . , p, obtain an s2 × (2f) matrix
Mj = [Aj1V, . . . ,AjfV].
Step 3. For n = s2 and q = 2pf , define an n× q matrix
M = [M1, . . . ,Mp]. (9)
Lemma 1 from Lin et al. (2009) captures the structure of M in (9).
Lemma 1. (a) The matrix M is an s2 × (2pf) Latin hypercube. (b) The correlation
matrix of M is ρ(B)⊗ I2f , where I2f is the identity matrix of order 2f .
Remark 1. For M constructed above, the correlation percentage vector δ in (8) has
δtk = [p(2f − 1) + (p− 1)δtk(B)]/(2pf − 1), for k = 1, . . . , q. (10)
7
Example 1. Let A be an OA(49, 8, 7) from Hedayat et al. (1999) and let B be an
NOLHD(7, 12) given by
−3 0 −1 0 3 3 0 −2 1 −3 −1 −3
−2 −1 1 −3 −1 −3 1 −3 −2 −1 1 3
−1 3 0 3 0 −2 2 0 −1 3 −3 −1
0 −2 3 2 −2 2 −2 1 −3 1 2 −2
1 1 −3 −1 −3 1 −3 −1 3 2 0 1
2 −3 −2 1 1 −1 3 2 2 0 3 0
3 2 2 −2 2 0 −1 3 0 −2 −2 2
,
where ρave = 0.3038, ρm = 0.9643, and δ = (0.500, 0.364, 0.136, 0.136) for t = (0.1, 0.05, 0.01,
0.005). Here, M from Lemma 1 is an NOLHD(49, 96) with ρave = 0.1034 and ρm = 0.9643.
From (10), M has δ = (0.942, 0.926, 0.9, 0.9) for t = (0.1, 0.05, 0.01, 0.005). More generally,
if B is an NOLHD(7, p) for an integer p ≥ 1, Lemma 1 gives an NOLHD(49, 8p).
2.2 A Construction Method Using the Kronecker Product
We propose a generalization of the method in Lin et al. (2010) for constructing NOLHDs
to provide designs with better low-dimensional projection properties than those from the
original method. For j = 1, . . . ,m2, let Cj = (cjik) be an n1 × m1 Latin hypercube and
let Aj = (ajik) be an n1 ×m1 matrix with entries ±1. Let B = (bij) be an n2 ×m2 Latin
hypercube, let D = (dij) be an n2 ×m2 matrix with entries ±1 and let r be a real number.
The proposed method constructs
M =
b11A1 + rd11C1 b12A2 + rd12C2 . . . b1m2Am2 + rd1m2Cm2
b21A1 + rd21C1 b22A2 + rd22C2 . . . b2m2Am2 + rd2m2Cm2
......
. . ....
bn21A1 + rdn21C1 bn22A2 + rdn22C2 . . . bn2m2Am2 + rdn2m2Cm2
. (11)
In contrast, the method in Lin et al. (2010) constructs
L = A⊗B + rC⊗D, (12)
8
where A = (aij) is an n1 × m1 matrix with entries ±1, C = (cij) is an n1 × m1 Latin
hypercube, and B, D and r are as in (11). Lin et al. (2010) provided the conditions for L
to be a nearly orthogonal Latin hypercube. The points of the design in (12) lie on straight
lines when projected onto some factor pairs, which is undesirable for the Lasso problem.
This drawback due to the use of the same matrix A and the same matrix C for each entry
of B and D in (12). The generalization in (11) eliminates this undesirable pattern by using
different matrices A1, . . . ,Am2 and different matrices C1, . . . ,Cm2 . Proposition 1 establishes
conditions for M in (11) to be a Latin hypercube.
Proposition 1. Let r = n2. Then M in (11) is a Latin hypercube if one of the following
conditions holds:
(a) for j = 1, . . . ,m2, the entries of Aj and Cj satisfy that for i = 1, . . . ,m1, cjpi = −cjp′i
and ajpi = aj
p′i;
(b) for k = 1, . . . ,m2, the entries of B and D satisfy that bqk = −bq′k and dqk = dq′k.
Proposition 1 can be verified by using an argument similar to the proof of Lemma 1 in
Lin et al. (2010) and thus is omitted. Proposition 2 studies the orthogonality of M in terms
of the structure of the Aj, B, the Cj and D.
Proposition 2. Suppose the Aj, B, Cj, D and r in (11) satisfy condition (a) or (b) in
Proposition 1 and M in (11) is a Latin hypercube. In addition, assume that the Aj, B and
D are orthogonal, and that BTD = 0 or ATj Cj = 0 holds for all j. Then for M, we have
that
(a) ρm = max{w1ρm(Cj), j = 1, . . . ,m2}, where w1 = n22(n
21 − 1)/(n2
1n22 − 1);
(b) ρave =√w2
∑m2
j=1 ρ2ave(Cj)/m2, where w2 = (m1 − 1)w2
1/(m1m2 − 1);
(c) for t = (t1, . . . , tq) with entries between 0 and 1, δtk(M) ≥∑m2
j=1 δtk(Cj)/m2 for k =
1, . . . , q;
(d) the matrix M is orthogonal if and only if C1, . . . ,Cm2 are all orthogonal.
Proof. Let mjk and mj′k′ be columns (j − 1)m2 + k and (j′ − 1)m2 + k′ of M in (11),
respectively. Take n = n1n2. Let ρ(mjk,mj′k′) be the correlation between mjk and mj′k′
9
defined in (5). Express 12−1n(n2 − 1)ρ(mjk,mj′k′) as
n2∑i1=1
n1∑i2=1
(bi1jaji2k + n2di1jc
ji2k)(bi1j′aj′
i2k′ + n2di1j′cj′
i2k′),
which equals
n2∑i1=1
bi1jbi1j′
n1∑i2=1
aji2kaj′
i2k′ + n2
n2∑i1=1
di1jbi1j′
n1∑i2=1
cji2kaj′
i2k′
+n2
n2∑i1=1
bi1jdi1j′
n1∑i2=1
aji2kcj′
i2k′ + n22
n2∑i1=1
di1jdi1j′
n1∑i2=1
cji2kcj′
i2k′
=n2∑
i1=1
bi1jbi1j′
n1∑i2=1
aji2kaj′
i2k′ + n22
n2∑i1=1
di1jdi1j′
n1∑i2=1
cji2kcj′
i2k′ .
Thus, ρ(mjk,mj′k′) is zero for j 6= j′ and is n22(n
21 − 1)ρkk′(Cj)/(n
2 − 1) for j = j′ and
k 6= k′. By the definitions of ρm and ρave, the results in (a) and (b) hold. For k = 1, . . . , q,
note that
δtk(M) = {m2(m2 − 1)m21 +m1(m1 − 1)
m2∑j=1
δtk(Cj)}/{m1m2(m1m2 − 1)}
=
m2∑j=1
δtk(Cj)/m2 + [(m2 − 1)m1{1−m2∑j=1
δtk(Cj)/m2}]/(m1m2 − 1).
The result in (c) now follows because∑m2
j=1 δtk(Cj)/m2 ≤ 1. Then (d) is evident from (a),
(b) and (c). This completes the proof.
Proposition 2 expresses the near orthogonality of M in (11) in terms of that of the
Cj and establishes conditions of the Aj, B and D in order for M to be an orthogonal
Latin hypercube. The required matrices in Proposition 2 can be chosen as follows. First,
orthogonal matrices useful for obtaining the Aj and D are readily available from Hadamard
matrices when n1 and n2 are multiples of four. Second, orthogonal Latin hypercubes useful
for obtaining B are available from Pang et al. (2009), Lin et al. (2009) and Lin et al. (2010),
among others. If the Aj, B and D are orthogonal, and either BT D = 0 or ATj Cj = 0, then
M is orthogonal when the Cj are orthogonal Latin hypercubes. If the Cj are NOLHDs like
those from Lin et al. (2009) and Lin et al. (2010), then M is nearly orthogonal. If C1 is an
NOLHD, C2, . . . ,Cm2 can be obtained by permuting the rows of C1.
10
Example 2. Let
B =12
1 −3 7 5
3 1 5 −7
5 −7 −3 −1
7 5 −1 3
−1 3 −7 −5
−3 −1 −5 7
−5 7 3 1
−7 −5 1 −3
,D =
1 1 1 1
1 1 −1 −1
1 −1 1 −1
1 −1 −1 1
−1 1 1 1
−1 1 −1 −1
−1 −1 1 −1
−1 −1 −1 1
,
A1 =
1 1 1 1 1 1
1 −1 1 1 1 −1
−1 1 −1 1 1 1
−1 −1 1 −1 1 1
1 −1 −1 1 −1 1
−1 1 −1 −1 1 −1
−1 −1 1 −1 −1 1
−1 −1 −1 1 −1 −1
1 −1 −1 −1 1 −1
1 1 −1 −1 −1 1
1 1 1 −1 −1 −1
−1 1 1 1 −1 −1
, and C1 =12
−11 −9 9 11 5 1
−9 5 −1 −5 −9 11
−7 11 −3 3 1 −7
−5 −1 −9 −9 −1 −9
−3 −7 5 −11 7 −1
−1 9 −7 5 9 5
1 −3 7 −7 −7 3
3 −11 −11 9 −11 −3
5 7 11 7 −5 −5
7 −5 −5 1 11 7
9 1 3 −3 3 −11
11 3 1 −1 −3 9
.
For j = 2, 3, 4, obtain Aj and Cj by permuting the rows of A1 and C1, respectively. Using
the above matrices, M in (11) is a 96× 24 orthogonal Latin hypercube.
Example 3. Let C1 be an NOLHD(25, 24) constructed by Lemma 1 with an OA(25, 6, 5)
from Hedayat et al. (1999) and an NOLHD(5, 4) given by
−2 2 0 0
−1 −2 1 2
0 −1 −2 −1
1 0 2 −2
2 1 −1 1
.
Permute the rows of C1 to get an NOLHD C2. Let A1 and A2 be two 25 × 24 nearly
orthogonal matrices generated by using the Gendex DOE software based on Nguyen (1996).
11
Using
B =
12−1
2
−12
12
and D =
1 1
1 1
,
M in (11) is an NOLHD(50, 48).
3 NUMERICAL ILLUSTRATION
In this section, we provide numerical examples to investigate several different types of designs
by comparing the number of false selections γ in (4) for the Lasso. For a design matrix with n
runs and p factors, Method I uses an NOLHD(n, p) from Section 2, Method II uses an RLHD
from (7), and Method III uses a random sampling method. Denote by γNOLH , γRLHD and
γIID the γ values of the false selections of these methods, respectively. The focus here is to
investigate the effect of the design matrix X on γ, and the same ε = (ε1, . . . , εn) is used to
generate the response y = (y1, . . . , yn) in (1). The tuning parameter λ in (3) is selected by
cross-validation with five folds. The package lars (Efron, Hastie, Johnstone, and Tibshirani,
2003) in R (R, 2010) is used to compute the Lasso solution β̂ in (3).
Example 4. For the model in (1), let p = 48, σ = 8, and β = (0.8, 1.0, . . . , 3, 0, . . . , 0)
with zeros for the last 36 coefficients. For Methods I, II, and III, choose n = 50 with n ≈ p.
Method I takes the NOLHD(50, 48) from Example 3. We compute γNOLH , γRLHD, and γIID
100 times. Table 1 compares three quartiles and Figure 2 depicts the boxplots of the false
selection γ values of these methods, where γNOLH is smaller than γRLHD and γIID.
Table 1. Three quartiles of the false selections γNOLH , γRLHD, and γIID values over the 100
replicates for Example 4
γNOLH γRLHD γIID
median 14.00 19.50 18.00
1st quartile 13.00 15.00 14.00
3rd quartile 16.00 24.00 23.00
12
NOLH LHD IID
510
1520
2530
Figure 2. Boxplots of the false selection values γNOLH , γRLHD, and γIID over the 100
replicates for Example 4.
Example 5. For the model in (1), let p = 96, σ = 8, and β = (0.2, 0.4, . . . , 3, 0, . . . , 0)
with the last 81 coefficients being zero. For Methods I, II, and III, choose n = 98 with
n ≈ p. Method I takes the NOLHD(98, 96) obtained using the method in Section 2.2. Table
2 compares three quartiles of γNOLH , γRLHD, and γIID values over the 100 replicates. Figure
3 depicts the boxplots of the false selection γ values of these methods. Once again, γNOLH
is significantly smaller than γRLHD and γIID.
Table 2. Three quartiles of the false selections γNOLH , γRLHD, and γIID values over the 100
replicates for Example 5
γNOLH γRLHD γIID
median 22.00 28.50 30.00
1st quartile 19.00 24.00 25.75
3rd quartile 25.00 32.00 35.00
13
NOLH LHD IID
1520
2530
3540
45
Figure 3. Boxplots of the false selection values γNOLH , γRLHD, and γIID values over the 100
replicates for Example 5.
3.1 Designs with p > n
Example 6. For the model in (1), let p = 96, σ = 8 and β = (0.2, 0.4, . . . , 3, 0, . . . , 0)
with the last 81 coefficients being zero. For Methods I, II, and III, take n = 49 with p > n.
Method I uses the NOLHD(49, 96) in Section 2.1 . Table 3 compares three quartiles of γNOLH ,
γRLHD, and γIID values over the 100 replicates. Figure 4 depicts the boxplots of the false
selection γ values of these methods. Here Method I significantly outperforms Method II and
Method III as γNOLH is much smaller than γRLHD and γIID.
Table 3. Three quartiles of the false selections γNOLH , γRLHD, and γIID values over the 100
replicates for Example 6
γNOLH γRLHD γIID
median 20.00 25.00 25.00
1st quartile 15.00 21.00 22.00
3rd quartile 25.00 28.25 28.00
14
NOLH LHD IID
1015
2025
3035
Figure 4. Boxplots of the false selection values γNOLH , γRLHD, and γIID values over the 100
replicates for Example 6.
Example 7. For the model in (1), let p = 192, σ = 8 and β = (0.05, 0.2, . . . , 3, 0, . . . , 0)
with the last 172 coefficients being zero. For Methods I, II, and III, take n = 64 with p > n.
Method I uses an NOLHD(64, 192) constructed by Lemma 1 in Section 2.1. Table 4 compares
three quartiles of γNOLH , γRLHD, and γIID values over the 100 replicates. Figure 5 depicts
the boxplots of the false selection γ values for these methods, where γNOLH is much smaller
than γRLHD and γIID. Using an NOLHD leads to significant improvement of the Lasso for
this small n, large p case.
Table 4. Three quartiles of the false selections γNOLH , γRLHD, and γIID values over the 100
replicates in Example 7
γNOLH γRLHD γIID
median 26.50 42.00 40.00
1st quartile 21.00 38.00 32.00
3rd quartile 35.00 46.00 45.25
15
NOLH LHD IID
2030
4050
Figure 5. Boxplots of the false selection values γNOLH , γRLHD, and γIID values over the 100
replicates for Example 7.
These examples all suggest that the Lasso solution with an NOLHD gives more accurate
variable selection than those of the two competing designs. Comparing Figures 2–5 indicates
that the advantage of NOLHDs grows as the ratio p/n increases.
3.2 Multi-level Designs for Accommodating Nonlinear Effects
We now illustrate the advantage of the proposed multi-level designs over two-level designs
in variable selection with nonlinear effects. Multi-level designs can identify some nonlinear
effects and thus improve the accuracy of Lasso variable selection. We consider Method I∗,
a modified Method I using the proposed multi-level design for Lasso variable selection. The
proposed designs can learn nonlinear effects by methods like the main-effect plots (Box,
Hunter and Hunter, 2005) and then fit the Lasso under a linear model after subtracting
those effects from the response. Specifically, Method I∗ first regresses the response against
some predictors that are known to have possible nonlinear effects and then uses the residuals
as the new response to fit the Lasso with the remaining predictors. In addition to Methods
16
I-III, we also include Method IV and Method V which consider designs with two levels
±(n − 1)/2. The Method IV uses a random two-level design for the Lasso. The random
two-level design is generated by randomly assigning each entry with one of the two levels.
The Method V is to use a two-level Bayesian D-optimal supersaturated designs (Jones, Lin
and Nachtsheim, 2008) for the Lasso. The supersaturated designs are designs whose number
of factors exceeds the run size. It needs to be made clear that no D-optimal supersaturated
design under the Lasso model is currently available and all such designs are constructed
based on various linear model criteria. Denote γNOLH∗, γRAND, γSSD as the value of false
selections of Method I∗, Method IV, and Method V, respectively.
Example 8. Let p = 48, σ = 8, and the sample size n = 50. To accommodate the nonlinear
effects, we generate the response using the following model,
y = 0.8x1 + 1.0x2 + 1.2x3 + 1.4x4 + 1.6x5 + 1.8x246 + 2.0x3
47 + 2.2sin(x48) + ε. (13)
The nonlinear effects are unknown, and such information cannot be used in the design stage.
We consider Lasso with the linear model in (1) to fit the data. For Method I, we use an
NOLHD(50, 48) for constructing the design matrix. Method I∗ uses the same design matrix as
in Method I, but has identified the nonlinear effects and subtracting them from the response
before fitting the Lasso. Table 5 compares the three quartiles of γNOLH∗, γNOLH , γRLHD,
γIID, γRAND, and γSSD values over the 100 replicates. Figure 6 depicts the boxplots of the
γ values for the six methods.
Table 5. Three quartiles of the false selections γNOLH∗, γNOLH , γRLHD, γIID, γRAND, and
γSSD values over the 100 replicates in Example 8
γNOLH∗ γNOLH γRLHD γIID γRAND γSSD
median 12.00 15.00 19.00 23.00 29.00 27.00
1st quartile 11.00 13.00 18.00 20.75 28.00 26.00
3rd quartile 13.00 16.00 22.00 25.25 31.00 29.00
Example 9. The setting of this example is the same as Example 8. But p = 96 and the
17
NOLH* NOLH LHD IID Rand SSD
1015
2025
3035
Figure 6. Boxplots of the false selection values γNOLH∗ γNOLH , γRLHD, γIID, γRAND, and
γSSD values over the 100 replicates for Example 8.
sample size n = 49. The response is generated from the following model,
y =12∑
k=1
αkxk + 2.6x294 + 2.8x3
95 + 3.0sin(x96) + ε. (14)
with αk = 0.2k. For Method I and Method I∗, we use an NOLHD(49, 96) for constructing the
design matrix. We compare Method I and Method I∗ with Methods II-V. Table 6 reports
the three quartiles of γNOLH∗, γNOLH , γRLHD, γIID, γRAND, and γSSD values over the 100
replicates. Figure 7 depicts the boxplots of the γ values for the six methods.
Table 6. Three quartiles of the false selections γNOLH∗, γNOLH , γRLHD, γIID, γRAND, and
γSSD values over the 100 replicates in Example 9
γNOLH∗ γNOLH γRLHD γIID γRAND γSSD
median 23.00 26.00 32.00 36.00 41.00 44.00
1st quartile 21.75 24.00 29.00 33.00 38.75 43.00
3rd quartile 25.00 27.00 34.00 38.00 44.00 45.00
18
NOLH* NOLH LHD IID Rand SSD
2025
3035
4045
Figure 7. Boxplots of the false selection values γNOLH∗ γNOLH , γRLHD, γIID, γRAND, and
γSSD values over the 100 replicates for Example 9.
Examples 8-9 indicate that the proposed Method I∗ outperforms Method I. Method I
directly applies the Lasso with the linear model to fit the data containing nonlinear effects.
Taking advantage of multi-level designs to identify the nonlinear effects and filter them out,
Method I∗ improves the variable selection accuracy. We also observed that both Method I
and Method I∗ provide more accurate variable selection than Methods II-V. This indicates
that the proposed design still performs well in variable selection with the presence of model
inadequacy.
4 DISCUSSION
We have proposed to use NOLHDs, originally developed in computer experiments, to signifi-
cantly enhance the accuracy of the Lasso variable selection. The effectiveness of this method
has been illustrated by several examples in Section 3.
In this work, we have considered the situation where the factors need to take multiple
levels. For situations that factors are restricted to two levels, there are optimal two-level
19
designs based on certain criteria when the model is adequate. However, these existing optimal
two-level designs are not developed for variable selection and might not work well.
To elaborate this point, we evaluate the accuracy of Lasso variable selection under three
different types of designs with two levels ±(n− 1)/2: Method IV, random two-level designs;
Method V, Bayesian D-optimal supersaturated designs described in Section 3.2; and Method
VI, balanced two-level designs. The Bayesian D-optimal supersaturated designs are gener-
ated by using the statistical software JMP. Popular methods of constructing supersaturated
designs include Lin (1993), Wu (1993), Nguyen (1996), Jones, Lin and Nachtsheim (2008).
A balanced two-level design of n runs for p factors is an n × p matrix in which the two
levels appear equally often in each column. For a given design matrix X, the responses
y = (y1, . . . , yn) are generated from the model in (1). Denote by γSSD, γBLAN , and γRAND
the γ values of false selections associated with Methods IV-VI, respectively. We consider
three different examples as follows.
Example 10. For the model in (1), let n = 50, p = 48, σ = 8, and β = (0.8, 1.0, . . . , 3, 0, . . . ,
0) with zeros for the last 36 coefficients.
Example 11. For the model in (1), let n = 49, p = 96, σ = 8 and β = (0.2, 0.4, . . . , 3, 0, . . . ,
0) with the last 81 coefficients being zero.
Example 12. For the model in (1), let n = 64, p = 192, σ = 8 and β = (0.05, 0.2, . . . , 3, 0, . . .
, 0) with the last 172 coefficients being zero.
Table 7 compares three quartiles of γSSD, γBLAN , and γRAND values over the 100 replicates
for Examples 10-12, respectively. The results show that the Bayesian D-optimal supersatu-
rated designs (SSD) outperform the other two designs in Examples 11-12 but not in Example
10. This mixed performance can be due to the fact that the D-optimality criterion of such
designs was developed for ordinary least squares, not for the Lasso variable selection.
The settings of Examples 10-12 are the same as Examples 4, 6, and 7, respectively, which
compared the proposed multi-level designs. Comparing these six examples, the Lasso under
the proposed multi-level designs generally perform better than that under the two-level
Bayesian D-optimal supersaturated designs.
20
Table 7. Quartiles of the false selections γSSD, γBLAN , and γRAND values over the 100
replicates for Examples 10-12, respectively
γSSD γBLAN γRAND
Example 10
median 19.00 19.00 18.00
1st quartile 15.00 15.00 16.00
3rd quartile 24.25 23.00 22.00
Example 11
median 20.00 24.00 25.00
1st quartile 17.00 21.00 22.00
3rd quartile 23.25 27.00 28.00
Example 12
median 27.00 42.00 38.00
1st quartile 24.00 36.00 32.00
3rd quartile 32.25 45.25 43.00
Constructing two-level optimal design for the Lasso can be an interesting topic for future
research. One possibility is to extend the ideas behind Meyer, Steinberg, and Box (1996)
and Bingham and Chipman (2007). But this new problem poses significant challenges since
the effect of X on the number of false selections γ cannot be expressed in a closed form.
Some progress in this direction is made in Xing et al. (2011).
Acknowledgement
Lin is supported by grants from the Natural Sciences and Engineering Research Council
of Canada. Qian is supported by National Science Foundation Grant DMS-0969616. The
authors wish to thank Hugh Chipman for useful suggestions. We also like to thank the editor
and associate editor for their constructive comments.
21
References
Bingham, D., and Chipman, H. A. (2007), Incorporating Prior Information in Optimal
Design for Model Selection, Technometrics, 49, 155–163.
Bingham, D., Sitter, R. R., and Tang, B. (2009), Orthogonal and Nearly Orthogonal Designs
for Computer Experiments, Biometrika, 96, 51–65.
Box, G. E. P., and Draper, N. R. (1959), A Basis for the Selection of a Response Surface
Design, Journal of the American Statistical Association, 54, 622–654.
Box, G. E. P., Hunter, W. G., and Hunter, J. S. (2005), Statistics for Experimenters:
Design, Innovation, and Discovery, 2nd Edition, New York: John Wiley & Sons.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani R. (2003), Least Angle Regression,
Annals of Statistics, 32, 407–499.
Fedorov, V., and Hackl, P. (1997). Model-Oriented Design of Experiments. Lecture Notes
in Statistics, 125, New York: Springer-Verlag.
Hedayat, A. S., Sloane, N. J. A., and Stufken, J. (1999), Orthogonal Arrays: Theory and
Applications, New York: Springer-Verlag.
Jones. B., Lin, D. K. J., and Nachtsheim, C. J. (2008), Bayesian D-optimal Supersaturated
Designs, Journal of Statistical Planning and Inference, 138, 86–92.
Lin, D. K. J. (1993), A New Class of Supersaturated Designs, Technometrics, 35, 28–31.
Lin, C. D., Bingham, D., Sitter, R. R., and Tang, B. (2010), A New and Flexible Method for
Constructing Designs for Computer Experiments, Annals of Statistics, 38, 1460–1477.
Lin, C. D., Mukerjee, R., and Tang, B. (2009), Construction of Orthogonal and Nearly
Orthogonal Latin Hypercubes, Biometrika, 96, 243–247.
Lin, C. D. and Sitter, R. R. (2008), An Isomorphism Check for Two-level Fractional Fac-
torial Designs, Journal of Statistical Planning and Inference, 138, 1085-1101.
22
McKay, M. D., Beckman, R. J., and Conover, W. J. (1979), A Comparison of Three Methods
for Selecting Values of Input Variables in the Analysis of Output from a Computer
Code, Technometrics, 21, 239–245.
Meyer, R. D., Steinberg, D. M., and Box, G. E. P. (1996), Follow-up Designs to Resolve
Confounding in Multifactor Experiments, Technometrics, 38, 303–313.
Nguyen, N. (1996), A Note on Constructing Near-Orthogonal Arrays with Economic Run
Size, Technometrics, 38, 279–283.
Owen, A. B. (1992), A Central Limit Theorem for Latin Hypercube Sampling, Journal of
the Royal Statistical Society, Series B, 54, 541–551.
Pang, F., Liu, M. Q., and Lin, D. K. J. (2009), A Construction Method for Orthogonal
Latin Hypercube Designs with Prime Power Levels, Statistica Sinica, 19, 1721–1728.
R (2010), The R Project for Statistical Computing.
Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), Design and Analysis of
Computer Experiments, Statistical Science, 49, 409–423.
Steinberg, D. M., and Lin, D. K. J. (2006), A Construction Method for Orthogonal Latin
Hypercube Designs, Biometrika, 93, 279–288.
Sun, F., Liu, M. Q., and Lin, D. K. J. (2009), Construction of Orthogonal Latin Hypercube
Designs, Biometrika, 96, 971–974.
Sun, F., Liu, M. Q., and Lin, D. K. J. (2010), Construction of Orthogonal Latin Hypercube
Designs with Flexible Run Sizes, Journal of Statistical Planning and Inference, 140,
3236–3242.
Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of the
Royal Statistical Society, Series B, 58, 267–288.
Tibshirani, R. (2011), Regression Shrinkage and Selection via the Lasso: A Retrospective,
Journal of the Royal Statistical Society, Series B, 73, 273–282.
23
Xing, D., Wan, H., and Zhu, Y. (2011), Optimal Supersaturated Design for Variable Selec-
tion Via Lasso, working paper, Purdue University.
Yuan, M., Joseph, V.R., and Lin, Y. (2007), An Efficient Variable Selection Approach for
Analyzing Designed Experiments, Technometrics, 49, 430–439.
Wu, C. F. J. (1993), Construction of Supersaturated Designs through Partially Aliased
Interactions, Biometrika, 80, 661–669.
Wu, C. F. J., and Hamada, M. (2009), Experiments: Planning, Analysis, and Parameter
Design Optimization, 2nd Edition, New York: John Wiley & Sons.
Ye, K. Q. (1998), Orthogonal Column Latin Hypercubes and Their Application in Computer
Experiments, Journal of the American Statistical Association, 93, 1430–1439.
24