Structural Properties of Afﬁne Sparsity Constraints · Noname manuscript No. (will be inserted by...

Noname manuscript No.(will be inserted by the editor)

Structural Properties of Affine Sparsity Constraints

Hongbo Dong · Miju Ahn · Jong-Shi Pang

Received: date / Accepted: date

Abstract We introduce a new constraint system for sparse variable selection in statistical learning. Such a systemarises when there are logical conditions on the sparsity of certain unknown model parameters that need to beincorporated into their selection process. Formally, extending a cardinality constraint, an affine sparsity constraint(ASC) is defined by a linear inequality with two sets of variables: one set of continuous variables and the otherset represented by their nonzero patterns. This paper aims to study an ASC system consisting of finitely manyaffine sparsity constraints. We investigate a number of fundamental structural properties of the solution set ofsuch a non-standard system of inequalities, including its closedness and the description of its closure, continuousapproximations and their set convergence, and characterizations of its tangent cones for use in optimization. Basedon the obtained structural properties of an ASC system, we investigate the convergence of B(ouligand) stationarysolutions when the ASC is approximated by surrogates of the step `0-function commonly employed in sparsityrepresentation. Our study lays a solid mathematical foundation for solving optimization problems involving theseaffine sparsity constraints through their continuous approximations.

Keywords Sparsity systems · Set convergence · Nonconvex optimization · B(ouligand) stationary points

Mathematics Subject Classification (2000) 90C26 · 90C90 · 62J05

1 Introduction

Judicious variable selection is an important task in statistical modeling. Practitioners wish to explain the data inthe simplest way, i.e., following the parsimony principle and omitting redundant/unnecessary predictor variablesthat may add noise in the prediction or estimation of the essential quantities. In addition, for highly ill-posed prob-lems, or applications where model interpretability is a major concern, context-specific assumptions and domainknowledge may require regularization for stable estimation and development of an interpretable model.

A popular approach for simultaneous variable selection and parameter estimation is to solve an optimizationproblem with two criteria. One is a loss (or residual) function measuring “model accuracy”, i.e., how well themodel fits available data sets, while the other criterion is a “penalty function” aiming to reduce “model complex-ity”, i.e., obeying the parsimony principle that simpler is better. One complexity measure is the number of nonzero

The research of the second and the third authors were partially supported by the U.S. National Science Foundation grant IIS-1632971.

Hongbo DongDepartment of Mathematics and StatisticsWashington State UniversityE-mail: [email protected]

Miju Ahn,Daniel J. Epstein Department of Industrial & Systems EngineeringUniversity of Southern CaliforniaE-mail: [email protected]

Jong-Shi PangDaniel J. Epstein Department of Industrial & Systems EngineeringUniversity of Southern CaliforniaE-mail: [email protected]

variables in the model; in this case, the goal of the penalty function is to promote the sparsity of the unknownparameters to be estimated. Early examples in the statistics literature include the lasso [31], fused lasso [32] andcovariance selection for Gaussian graphical models [2,16], all leading to convex optimization problems; see themonograph [18] for more variations of lasso. Nonconvex penalty functions have also been proposed and studied,among which are the smoothly clipped absolute deviation (SCAD) [14], the minimax concave penalty (MCP) [35],and Capped-`1 [36], to name a few. Such nonconvex methods have been shown to enjoy several theoretical andpractical advantages over convex methods in increasingly more applied contexts. The recent paper [1] presentsa unified formulation of many of the most commonly used nonconvex penalty functions employed in sparsityrepresentation and provides references for their applications.

The present paper is motivated by the increasingly advanced modeling of the variable selection process.Namely, in many applications, logical conditions among the variables need to be considered, in order to en-sure a meaningful/interpretable statistical model, or to exploit domain knowledge to increase model fidelity. Suchlogical conditions could be that certain variables are allowed to become active only if certain other variables areselected in the model [7], or that the selection must respect given groupings of the variables [19] for which theremay be different application dependent stipulations such as “within group sparsity” or “groupwise sparsity”. Fora recent application of such a group selection problem for building an integrative model in genetic data analy-sis, see [27]. Current approaches in the statistics literature on the modeling of such logical conditions are mostlydone in an ad hoc way and rely mainly on convex formulations for computational ease. As it is well-known inthe integer programming literature [25], such modeling simplifications are fundamentally flawed and could easilyyield inaccurate descriptions of the true logical connections among the model variables. In this paper, we pro-pose a rigorous framework as an attempt to address such problems in a faithful way, by first modeling the logicalconditions exactly and then understanding what needs to be done in order to make the resulting formulationscomputationally tractable for subsequent optimization and analysis, and ultimately, for statistical inference. Tobe specific, we propose to model the logical conditions on sparsity with a new type of constraints which we callaffine sparsity constraints (ASC). Such constraints are derived from a linear system of inequalities where some

of the continuous variables are individually replaced by the `0-function of a scalar variable: |t|0 ,

1 if t 6= 00 otherwise .

A most significant departure of the resulting ASC system from a standard system of linear inequalities in linearor (mixed) integer programming is that the former system involves the discontinuous `0-function and thus thesolution set may not be closed. This feature immediately challenges the optimization over such constraints andcalls for suitable approximations by continuous functions. Inspired by the family of surrogate sparsity functions[1] employed to approximate the `0-function, we are led to investigate the approximation of an ASC by replacingthe binary indicator function by a (continuous) surrogate sparsity function. Understanding the convergence of suchan approximated continuous system to the given discontinuous ASC system is a primary goal of this paper. Priorto addressing this convergence issue, we obtain a necessary and sufficient condition for the solution set of an ASCsystem to be closed, and derive an explicit expression of the closure of such a set in general. Overall, this paperis devoted to the study of some fundamental structural properties of affine sparsity constraints in preparation forsubsequent research of solving optimization problems subject to these constraints.

This paper is organized as follows. In Section 2, we formally define affine sparsity constraints, discuss some oftheir elementary features, and describe the applied problems that they model. We address the closure properties ofthe solutions of these constraints in Section 3, giving necessary and sufficient conditions for the closedness of thesolution set and two representations of its closure: one in terms of some auxiliary integer variables and the other interms of the continuous relaxations of the latter variables, under an appropriate assumption. Section 4 investigatesthe approximation of an ASC by replacing the `0-function by the family of univariate folded concave functionscommonly employed in statistical learning and sparsity optimization, and presents set convergence results of theapproximations. In Section 5, we derive representations of the tangent cones of the solution set of the ASC andits approximations, and present conditions under which the tangent cones of the approximate constraint sets canbe characterized (algebraically) as a finite union of convex sets. These tangent results are then used in the lastSection 6 to establish important convergence properties of the directional stationary solutions of approximationsof an ASC-constrained optimization problem.

Notation. We define some notations used in this paper. Based on the binary unvariate `0-function, we definethe multivariate `0-function ‖ • ‖0 : Rn → 0,1n by ‖x‖0 , (|xi|0)n

i=1 for any x ∈ Rn. For a given subset J of1, · · · ,n and a vector x ∈ Rn, xJ denotes the sub-vector of x with components indexed by the elements of J .For a matrix A ∈Rm×n and the same subset J , A•J denotes the columns of A indexed by J . A similar notationapplies to the rows of A. The support of a vector x is the index set of nonzero components of x and is denoted bysupp(x). For any matrix A ∈ Rm×n, the matrices A+ and A− are the entry-wise nonnegative and nonpositive parts

2

of A, such that for all i and j,

[A+]i j , max(Ai j,0) and [A−]i j , max(−Ai j,0). (1)

Thus we have the decomposition: A = A+−A− where A± are nonnegative matrices. A similar definition appliesto a vector x. For any x ∈ Rn, IBn(x,r) is an open ball (in a suitable norm) centered at x with radius r. For anyset S ⊆ Rn, cl(S) denotes its closure. A vector of all ones of a given dimension is written as 1 with the dimensionomitted in the notation.

2 ASC Systems: Introduction and Preliminary Discussion

Given a matrix A ∈ Rm×n and a vector b ∈ Rm, the ASC system is defined as the problem of finding a vectorβ ∈ Rn such that

A‖β ‖0 ≤ b ⇔

n

∑j=1

Ai j |β j |0 ≤ bi, for all i = 1, · · · ,m

. (2)

The (possibly empty) solution set of this system is denoted by SOL-ASC(A,b). Clearly, the system (2) can bewritten as:

A+ ‖β ‖0 ≤ A− ‖β ‖0 +b.

A particularly important special case is when A is 0-1 matrix, i.e., all its entries are either 0 or 1, and b is a positive

integral vector. In this case, the constraintn

∑j=1

Ai j |β j |0 ≤ bi becomes

∑j∈Ii

|β j |0 ≤ bi, where Ii , j | Ai j 6= 0,

which is an upper cardinality constraint stipulating that for each i = 1, · · · ,m, the number of nonzero componentsβ j for all j ∈Ii is no more than the given cardinality bi (assumed integer). Cardinality constraints of this type havebeen studied in [6,8,10,15,38] using various reformulations; applications of such constraints to sparse portfolioselection can be found in [6,8,9,11].

Needless to say, the ASC(A,b) is closely related to the system of linear inequalities: Aw ≤ b, w ∈ Rn. Never-theless, there are many important differences. One obvious difference is that the solution set of the former systemis a cone possibly with the origin omitted, i.e., the following clearly holds:

β ∈ SOL-ASC(A,b) ⇒ τ β ∈ SOL-ASC(A,b), ∀ scalars τ 6= 0;

thus, under no other restriction on β , a non-zero SOL-ASC(A,b) is always an unbounded set. In contrast, apolyhedron, which is the solution set of a system of linear inequalities, does not have the scaling property ingeneral. Further, an important feature that distinguishes the general ASC(A,b) where A can have positive andnegative entries from the special case of a cardinality constraint system where A is nonnegative is that the solutionset of the latter system must be closed (due to the lower semi-continuity of the `0-function) while SOL-ASC(A,b)is not necessarily so in the general case, as illustrated by the simple Example 1 below.

Example 1 Consider the simple case where A = [1 −1 ] and b = 0, yielding the system |β1|0 ≤ |β2|0. It is notdifficult to see that SOL-ASC(A,b) is the entire plane R2 except for the two half β1-axes; i.e., SOL-ASC(A,b) =(β1,β2) ∈ R2 | β2 6= 0

∪(0,0), which is obviously not closed.

As we shall see later, the non-closedness of SOL-ASC(A,b) is due to the presence of some negative entriesin the matrix A. An extreme case of this is when the entries of A are all either 0 or -1 and b is a negative integralvector. In this case, we obtain a lower cardinality constraint of the form:

∑j∈Ii

|β j |0 ≥ |bi |, where Ii , j | Ai j < 0,

that has minimally been studied in the literature to date. By imposing no sign restrictions on the matrix A, ourtreatment goes far beyond these special cases and accommodates recent interests in statistical variable selectionsubject to logical constraints. Another important difference between SOL-ASC and a polyhedron is their respectivetangent cones; see Section 5 for details.

3

It is natural to consider an extension of the ASC by including continuous variables; specifically, let b :Rk→Rm

be a given mapping and let Γ be a closed convex set in Rk. Defined by the triplet (A,b,Γ ), the extended ASC(xASC) system is the problem of finding a pair (β ,γ) ∈ Rn×Γ such that

A‖β ‖0 ≤ b(γ).

The (possibly empty) solution set of this system is denoted by SOL-xASC(A,b,Γ ). Subsequently, we will discusshow results of the ASC system can be extended to the xASC system, and show how these extended systems couldarise in the approximation of the ASC (see Subsection 4.2).

2.1 Source problems

In general, if M⊆0,1n is a subset of binary vectors containing all admissible vectors of ‖β‖0, then ‖β‖0 ∈M isequivalent to ‖β‖0 ∈ conv(M), which in principle can be formulated as an ASC system as conv(M) is a polytope.More specifically, logical conditions on the sparsity of the model variables can be modeled by affine constraintsusing the binary `0-indicators of these variables. In what follows we present two models of statistical regressionwith logical conditions on the unknown parameters.

Hierarchical variable selection. Consider the following regression model with interaction terms [7]:

y =n

∑i=1

β(1)i xi + ∑

1≤i< j≤nβ(2)i j xi x j + ε, (3)

where x ∈ Rn is the vector of model inputs, y ∈ R is the (univariate) model output, the β(1)i and β

(2)i j are the

unknown model parameters to be estimated, and ε is the (random) error of the model. It is common practicein the variable selection process to maintain certain hierarchical conditions (also called “heredity constraints”or “marginality” in the literature [7,17,23]) between the coefficient of the linear terms, β

(1)i , and those in the

interaction terms, β(2)i j . There are two types of hierarchical conditions. The strong hierarchical condition means that

an interaction term can be selected only if both of the linear terms are selected, i.e., |β (2)i j |0 ≤min

|β (1)

i |0, |β(1)j |0

for any i< j, while the weak hierarchical condition means that an interaction term can be selected only if one of thecorresponding linear terms is selected, i.e., |β (2)

i j |0 ≤ |β(1)i |0 + |β

(1)j |0. Clearly, both conditions can be represented

by linear inequalities in ‖β‖0, where β is the concatenated vector of β (1) and β (2).The hierarchical variable selection problem has received much attention in the statistics literature; for its

treatment, various convex relaxations of the hierarchical conditions have been proposed. For instance, in [7, Re-mark 1], it was suggested to employ the linear constraints: |β (2)

i j | ≤ β(1)+,i +β

(1)−,i for all i< j, to model the constraint:

|β (2)i j |0 ≤ |β

(1)i |0 where β

(1)±,i are the nonnegative and nonpositive parts of β

(1)i , as such, the latter two nonnegative

variables ought to satisfy the important complementarity constraint: β(1)+,i β

(1)−,i = 0. Nevertheless, the latter condi-

tion is dropped in the convex formulation. Note also the use of the absolute value |β (2)i j | as a replacement for the

`0-function |β (2)i j |0. These two convexification maneuvers lead to a restriction of the hierarchical constraints in the

sense that:

|β (2)i j | ≤ β

(1)+,i +β

(1)−,i ⇒ |β

(2)i j |0 ≤ |β

(1)i |0 ≤ |β

(1)i |0 + |β

(1)j |0.

Obviously, such simplistic convexifications can at best be considered a very crude treatment of the nonconvexhierarchial constraints; they were introduced merely to take advantage of the advances of convex programmingfor the optimization of an objective function subject to these constraints. In light of the recent surge in interestin using nonconvex surrogates to replace the `0-function, see e.g. [1] and the references therein, one is led toinvestigate the use of these surrogates in approximating the hierarchical constraints within the broader frameworkof an ASC.

Several other works have developed specialized methods to attempt to induce the “hierarchical effects”, basedon convex penalty functions that are originally designed for grouped variable selection [34] (see discussion of thenext topic). For example, the paper [37] proposed a general framework of constructing (convex) penalty functions

4

to derive the grouping or hierarchical effects. Specialized to the strong hierarchical case, their proposed penaltyfunction is: for λ > 0,

λ ∑j<k

[|β (2)

jk |+∥∥∥(β

(1)j , β

(1)k , β

(2)jk

)∥∥∥γ jk

]where 1 < γ jk, ∀ j < k

and ‖ • ‖γ jk is the γ jk-norm. Again, this is a very inaccurate way of expressing the strong hierarchical condition.Other related formulations can be found in [3], where combinations of convex functions are employed.

In summary the existing attempts are commendable as a first step towards the treatment of the hierarchicalconstraints; obviously there is much room for further studies and improvement in the treatment. Our work is a stepin the latter direction.

Group variable selection. In many applications, variables are naturally divided into groups. There are differentversions of grouped variable selection, and a selective review is given in [19]; see also [20,34]. Here we considertwo versions of this problem. One version is a variation of the group lasso discussed in [18, Section 4.3] where itis stated that: It is desirable to have all coefficients within a group to become zero (or nonzero) simultaneously.The formulation in the cited reference is as follows: instead of the basic linear regression model:

y =n

∑i=1

βi xi + ε, (4)

that expresses the model output directly as a combination of the core predictors xini=1, an aggregate model:

y =J

∑j=1

(z j )Tθ

j + ε′

in terms of the J group θ jJj=1 of (unknown) variates is postulated, where each θ j ∈ Rp j represents a group of

p j regression coefficients among the βi’s, with the vectors z j being the known covariates in group j. A convex

least-squares objective function [18, expression (4.5)] that contains the penaltyJ

∑j=1‖θ

j ‖2 is minimized whereby

the unknown variables θ j are obtained. While this provides a plausible approach to the group selection process,it does not exactly model the desired grouping as stipulated above with reference to the basic model (4) that isat the level of the individual predictors xi’s for i = 1, · · · ,n. Instead, we may minimize a combined (Lagrangian)objective fλ (β ), `(β )+λP(β ), which comprises of a loss function weighed by a sparsity penalty function, bothin terms of the original variable β , (βi)

ni=1, subject to the grouping conditions that can be formulated as follows.

For each j = 1, · · · ,J, let G j be the subset of 1, · · ·n containing indices i such that the variable βi in group j; thus

1, · · · ,n=J⋃

j=1

G j. Consider the system with some auxiliary group variables ζ j:

|βi |0 = ζ j ∀ i ∈ G j and j = 1, · · · ,J,

which can easily be seen to model exactly the desired grouping requirement. Clearly, the above is an xASC systemin the pair of variables (β ,ζ ). Note that no constraints are imposed on ζ j. When the `0-function is subsequentlyapproximated by a surrogate function, properties of such a function (e.g., nonnegativity and upper bounds) willnaturally transfer to restrictions on ζ j.

Consider the alternative stipulation of choosing the variables βi so that the number of groups covering allnonzero components is minimal. Together with the Lagrangian function fλ (β ), `(β )+λP(β ), the optimizationproblem of variable selection is: given positive coefficients c jJ

j=1,

minimizeβ ,ζ

fλ (β )+J

∑j=1

c j ζ j

subject to |βi |0 ≤ ∑j : i∈G j

ζ j, ∀ i = 1, · · · ,n

and ζ j ∈ 0,1, j = 1, · · · J,

(5)

5

where each xASC models the coverage of the variate βi in the groups that contain them; i.e., the variable βi is se-lected only if at least one of the groups containing predictor i is selected. The minimization of the (weighed) sumof the binary variables ζ j is a slight generalization of the goal of selecting the minimum number of groups in thecoverage of all the nonzero βi’s which is the special case with equal weights. Clearly, the relaxation of the binarycondition ζ j ∈ 0,1 to the continuous condition ζ j ∈ [0,1] leads to an xASC system in the pair of variables (β ,ζ ).

A special case: Tree structure. In what follows, we discuss a special case of the group selection problem thatjustifies the relaxation of the 0-1 restriction on the group indicator variables ζ j in the problem (5). The key tothis justification is the fact that for fixed β , the constraints of the problem are those of a well-known set coveringproblem [25, Part III]. As such the theory of the latter problem can be applied. In turn, this application is based onthe theory of balanced matrices which we quickly review; see [12,13]. A 0-1 matrix is balanced matrix if and onlyif for any odd number k ≥ 3, it does not contain a k× k submatrix whose row sums and column sums all equal

2 and which does not contain the 2× 2 submatrix[

1 11 1

]. A 0-1 matrix is called totally balanced (TB) if such

condition holds for all k≥ 3 (not just odd k). A sufficient condition for a matrix to be TB is that it does not contain

a submatrix F =

[1 11 0

][25, Part III, Proposition 4.5]. (In fact this condition is also necessary up to permutations

of rows and columns [25, Part III, Proposition 4.8].) Clearly the TB property is invariant under permutations ofrows and columns.

The model discussed below is related to the tree structured group lasso presented in [33]. Suppose that anytwo distinct members of the family G, G jJ

j=1 are such that either they do not intersect or one is a proper subsetof the other. In this case (which clearly includes the special case of non-overlapping groups; i.e., Gi ∩G j = /0for i 6= j), we can arrange the groups in a tree structure as follows. Let d , max

1≤ j≤J|G j| be the largest number of

elements contained in the individual groups. For each integer k = 1, · · · ,d, let Tk = G k1 , · · · ,G k

qk be the sub-

family of G consisting of all (distinct) groups each with exactly k variables; thus each G kr is one of the G j. For

simplicity, assume that each Tk is a non-empty sub-family for k = 1, · · · ,d; thus J =d

∑k=1

qk. It then follows that any

two different members of Tk do not overlap, i.e., for any k = 1, · · · ,d and r 6= r ′ both in 1, · · · ,qk, G kr ∩ G k

r ′ = /0.Furthermore, for each pair (k,k ′) satisfying 1 ≤ k ′ < k ≤ d, we call G k

r a parent node of G k ′r ′ if G k ′

r ′ is a propersubset of G k

r . Thus a parent node contains more elements than its descendent node(s). Note that it is not necessaryfor every element in the sub-family Tk to have a parent node in the sub-family Tk ′ , nor is it necessary for everyelement in T ′k to have a parent node in Tk. We define the element-group incidence matrix E ∈ Rn×J as follows.Arrange the columns of E in the order of ascendency of the groups in Tk for k = 1, · · · ,d; i.e.,

G 11 , · · · ,G 1

q1︸︷︷︸T1

; G 21 , · · · ,G 2

q2︸︷︷︸T2

; · · · ; G d−11 , · · · ,G d−1

qd−1︸︷︷︸Td−1

;G d1 , · · · ,G d

qd︸︷︷︸Td

;

then let Ei j =

1 if i is contained in group j0 otherwise . In terms of this matrix E, the constraints:

∑j : i∈G j

ζ j ≥ |βi |0 ∀ i ∈ supp(β )

can be written simply as ES•ζ ≥ 1|S| where S, supp(β ) and ES• denotes the rows of E indexed by S.

β3, β4, β5, β6T3

β1, β2 β3, β4 β5, β6T2

β5T1

G1 G2 G3 G4 G5β1 0 1 0 0 0

β2 0 1 0 0 0

β3 0 0 1 0 1

β4 0 0 1 0 1

β5 1 0 0 1 1

β6 0 0 0 1 1

6

An illustration of tree structure and the element-group incidence matrix E6 variables, 3 levels, and 5 groups:G1 = β5, G2 = β1,β2, G3 = β3,β4, G4 = β5,β6, and G5 = β3,β4,β5,β6

Proposition 1 Let the family of groups G jJj=1 have the tree structure defined above. For any β ∈ Rn and any

nonnegative coefficients c jJj=1, it holds that

min

J

∑j=1

c j ζ j | ζ ∈ 0,1J , ∑j : i∈G j

ζ j ≥ 1, ∀ i ∈ supp(β )

= min

J

∑j=1

c j ζ j | ζ ∈ [0,1 ]J , ∑j : i∈G j

ζ j ≥ 1, ∀ i ∈ supp(β )

.

Proof. Let β be given. Suppose that the matrix ES• has a submatrix F ,[

1 11 0

]corresponding to rows i < i ′

and columns j < j ′. Then i is contained in both groups corresponding to columns j and j ′. By the tree structureassumption and the ordering of the columns of E, this can only happen if the group at column j is a descendentnode of the group at column j ′. However, since Ei ′ j = 1 and Ei ′ j ′ = 0, element i ′ is in the descendent group butnot in the parent group, hence a contradiction. Thus the constraint matrix ES• is TB. By [25, Part III, Thm 4.13],the polyhedron

ζ ≥ 0 | ES•ζ ≥ 1|S|

is integral; i.e., all its extreme points are binary. This is enough to establish

the desired equality of the two minima.

3 Closedness and Closure

The closedness of a feasible region is very important in constrained optimization because the minimum value of alower semi-continuous objective function may not be attained on a non-closed feasible region, even if the objectivehas bounded level sets in the region. This section addresses the closedness issue of an (x)ASC system and derivesan expression of the closure of SOL-(x)ASC.

3.1 Necessary and sufficient conditions for closedness

It is convenient to define two related sets of binary vectors which capture the “combinatorial structures” ofSOL-ASC(A,b):

Z (A,b) , z ∈ 0,1n | Az ≤ b

and Z ≤(A,b) , z ∈ 0,1n | ∃ z ∈ 0,1n such that Az ≤ b and z ≤ z .(6)

The latter set Z ≤(A,b) contains binary vectors obtained by “zeroing out” some entries of the vectors in Z (A,b).These two sets provide the intermediate step to characterize the closedness of SOL-ASC(A,b). The followingproposition states the relationship between Z (A,b) and SOL-ASC(A,b). No proof is needed.

Proposition 2 It holds that

0,1n ∩ SOL-ASC(A,b)

= Z (A,b) ⊆ SOL-ASC(A,b) = β | ‖β ‖0 ∈ Z (A,b)︸︷︷︸to be extended to closure of SOL-ASC(A,b)

=⋃

d :di 6=0,∀ i

d z | z ∈ Z (A,b) .

Moreover, the inclusion is strict if Z (A,b) is neither the empty set nor the singleton 0.

Clearly, Z (A,b) is always closed and bounded, while SOL-ASC(A,b) needs not be closed and must beunbounded unless it is the singleton 0. The following result provides sufficient and necessary conditions forSOL-ASC(A,b) to be a closed set.

Proposition 3 Suppose SOL-ASC(A,b) 6= /0. The following three statements are equivalent.

7

(a) SOL-ASC(A,b) is closed.(b) Z (A,b) = Z ≤(A,b).(c) SOL-ASC(A,b) = β | A+‖β‖0 ≤ b and Z (A,b) = z ∈ 0,1n | A+z≤ b.

Proof. (a) ⇒ (b). It suffices to prove Z ≤(A,b) ⊆ Z (A,b). By way of contradiction, suppose there exist binaryvectors z≤ z with z ∈Z (A,b) 63 z. Clearly supp(z)$ supp(z). Define the binary vector β k with components

βki ,

1/k if i ∈ supp(z)\ supp(z)

zi otherwise.

It then follows that ‖β k‖0 = z ∈ Z (A,b) ⊆ SOL-ASC(A,b). Moreover, limk→∞

βk = z 6∈ Z (A,b). This contradicts

the closedness of SOL-ASC(A,b).(b) ⇒ (c). It suffices to show the inclusion: Z (A,b) ⊆ z ∈ 0,1n | A+z≤ b. Let z ∈ Z (A,b). We claim

thatn

∑j=1

max(Ai j,0) z j ≤ bi for arbitrary i = 1, · · · ,m. For a given i, define a vector z such that z j ,

z j if Ai j ≥ 00 if Ai j < 0

which clearly satisfies z≤ z. By (b), it follows that z ∈Z (A,b). We have

bi ≥n

∑j=1

max(Ai j,0)z j−n

∑j=1

max(−Ai j,0)z j =n

∑j=1

max(Ai j,0) z j.

Thus the claim holds.(c)⇒ (a). This is obvious because the mapping ‖•‖0 is lower semi-continuous and A+ contains only nonneg-

ative entries.

Remark 1 It is possible for SOL-ASC(A,b) to be a closed set when A contains negative entries. A trivial exampleis the set β | −‖β‖0 ≤ 0= Rn which corresponds to A being the negative identity matrix and b = 0.

Example 2 We illustrate the two sets Z (A,b) and Z ≤(A,b) using the regression model (3) with interaction termsthat satisfy the strong hierarchical conditions: |β (2)

i j |0 ≤ min(|β (1)

i |0, |β(1)j |0

)for all i < j and also a cardinality

constraint on β (1). In this case, we have, for some integer K > 0,

Z (A,b) =

(z(1),z(2) ) | z(2)i j ≤ min

(z(1)i , z(1)j

), ∀ i < j

|n

∑i=1

z(1)i ≤ K, z(2)i j , z(1)i all binary

.

We claim that

Z ≤(A,b) =

(z(1),z(2) ) |

n

∑i=1

maxj<i<`

(z(1)i , z(2)ji , z(2)i`

)≤ K, z(2)i j , z(1)i all binary

. (7)

To see this, let (z(1),z(2)) be a pair belonging to the right-hand set. Define

z(1)i , maxj<i<`

(z(1)i , z(2)ji , z(2)i`

).

It is not difficult to see that ( z(1),z(2) ) ∈Z (A,b). Conversely, let (z(1),z(2)) be a binary pair such that there exists(z(1), z(2)) ∈Z (A,b) such that (z(1),z(2))≤ (z(1), z(2)). Then,

n

∑i=1

maxj<i<`

(z(1)i , z(2)ji , z(2)i`

)≤

n

∑i=1

maxj<i<`

(z(1)i , z(2)ji , z(2)i`

)=

n

∑i=1

z(1)i ≤ K.

This proves that (z(1),z(2)) belongs to the right-hand set in (7). Hence the equality in this expression holds.

8

3.2 Closure of SOL-ASC

The expression (7) is interesting because the result below shows that the set Z ≤(A,b) determines the closureof the solution set of the ASC system, not only for the regression model with interaction terms under the stronghierarchical relation among its variates, but also in general.

Proposition 4 It holds that

cl [SOL-ASC(A,b) ] =

β | ‖β ‖0 ∈ Z ≤(A,b).

Proof. Let β k be a sequence of vectors in SOL-ASC(A,b) converging to the limit β ∞. We then have zk ,‖β k‖0 ∈ Z (A,b). Since Z (A,b) is a compact set, we may assume without loss of generality that the binarysequence zk converges to binary vector z∞ that must belong to Z (A,b). We then have,

|β ∞j |0 ≤ liminf

k→∞|β k

j |0 = z∞j , ∀ j = 1, · · · ,n.

By the definition of Z ≤(A,b), it follows that ‖β ∞‖0 ∈Z ≤(A,b). Thus

cl [SOL-ASC(A,b) ]⊆

β | ‖β‖0 ∈Z ≤(A,b).

To show the reverse inclusion, let β be such that ‖β‖0 ≤ z for some z ∈Z (A,b). The sequence

β k

, where

βkj ,

β j if |β j|0 = z j

1/k otherwise

j = 1, · · · ,n,

converges to β . Moreover, since ‖β k‖0 = z, it follows that β k ∈ SOL-ASC(A,b) for all k. So we have β ∈cl [SOL-ASC(A,b) ] as desired.

Propositions 2 and 4 have established the fundamental role the two sets Z (A,b) and Z ≤(A,b) play in thestudy of the ASC(A,b). The former set Z (A,b) is the intersection of the polyhedron z | Az ≤ b with the set0,1n of binary vectors, while a continuous relaxation of the latter set Z ≤(A,b) is

H ≤(A,b) , z ∈ [0,1 ]n | ∃ z ∈ [0,1 ]n such that Az ≤ b and z ≤ z .

Clearly cl [SOL-ASC(A,b) ] ⊆

β | ‖β ‖0 ∈ H ≤(A,b)

. A natural question is whether equality holds. An af-firmative answer to this question will simplify the task of verifying if a given vector β belongs to the left-handclosure, and facilitate the optimization over this closure. Indeed, according to Proposition 4, the former task canbe accomplished by solving an integer program. If the equality in question holds, then this task amounts to solvinga linear program. Furthermore, a representation in terms of the polytope H ≤(A,b) is key to the convergence ofthe approximation of the `0-function by continuous surrogate functions.

Before addressing the above question, we give an example to show that the desired equality of the two sets inquestion does not always hold without conditions.

Example 3 Consider the ASC system in two variables (β1,β2): |β1|0 ≤ 0.8 and |β2|0 ≤ 2|β1|0. Clearly, the onlysolution is (0,0). The set H ≤(A,b) for this system is:

(z1,z2 ) ∈ [0,1]2 | ∃(z1, z2) ∈ [0,1]2 such that z1 ≤ 0.8, z2 ≤ 2z1, and (z1,z2) ≤ (z1, z2)

which clearly contains the point (0,1). Hence β | ‖β‖0 ∈H ≤(A,b) is a superset of cl [SOL-ASC(A,b) ].

Roughly speaking, the condition below has to do with the rounding of the elements on certain faces ofH ≤(A,b) to integers without violating the linear system Az≤ b.

Assumption A. For any subset I of 1, · · · ,n,z ∈ [0,1 ]n | Az ≤ b and zI = 1|I |

6= /0 ⇒ Z (A,b) ∩

z | zI = 1|I |

6= /0.

It is not difficult to see that Example 3 fails this assumption with I = 2 because the set(z1,z2 ) ∈ [0,1]2 | z1 ≤ 0.8, z2 ≤ 2z1, and z2 = 1

is nonempty, but the constraints have no solutions with z1 ∈ 0,1.

9

Proposition 5 The following two statements hold:

(a) Under Assumption A, cl [SOL-ASC(A,b) ] =

β | ‖β‖0 ∈H ≤(A,b)

.(b) If Assumption A is violated by the index set I , then any vector β with supp(β ) =I belongs to the right-hand

but not the left-hand set in (a).

Proof. It suffices to show that if β is such that ‖β‖0 ≤ z for some z ∈ [0,1]n satisfying Az ≤ b, then β belongsto the closure of SOL-ASC(A,b). This follows readily by applying Assumption A to the index set I = supp(z).Thus (a) holds.

To prove (b), suppose I violates Assumption A. Let z ∈ [0,1 ]n be such that Az ≤ b and zI = 1|I | but theredoes not exist z ∈Z (A,b) with zI = 1|I |. Clearly z ≥ ‖β‖0. Hence ‖β‖0 ∈H ≤(A,b). But ‖β‖0 6∈Z ≤(A,b);for otherwise, there exists z ∈Z (A,b) satisfying z≥ ‖β‖0, which implies zI = 1|I |, which is a contradiction. ByProposition 4, it follows that β 6∈ cl [SOL-ASC(A,b) ].

Assumption A holds for a matrix A satisfying the column-wise uni-sign property; i.e., when each columnof A has either all nonpositive or all nonnegative entries (in particular, when A has only one row). In this case,fractional components (other than those in a given set I ) of a vector z ∈ [0,1]n satisfying Az≤ b can be roundedeither up or down, depending on the sign of the elements in the corresponding column, without violating theinequalities Az≤ b. This special sign property of A turns out to be important in the optimization subject to ASC’s;see Subsection 6.2. Another case where Assumption A holds is if

convex hull of Z (A,b) = z ∈ [0,1 ]n | Az ≤ b . (8)

Indeed, if the above equality holds, and if z ∈ [0,1]n satisfying Az ≤ b is such that zI = 1I , then there exists

zkKk=1 ⊆ Z (A,b) and positive scalars λkK

k=1 summing to unity such that z =K

∑k=1

λk zk. Since zI = 1I and

each zk is a binary vector, we must have zkI = 1I for all k = 1, · · · ,K. Thus Assumption A is valid under (8). Part

(b) of Proposition 5 shows that Assumption A is sharp for part (a) to be valid.

Remark 2 Stand-alone lower cardinality constraints are not particularly interesting as the closure of such a con-straint is the entire space. This may be one reason why lower cardinality constraints have not been found useful inan optimization context. However, when there are multiple ASCs, some of which may be of the lower cardinalitykind, the closure operation must be performed for the whole system (instead of the individual constraints). Theresults in this section provide proper tools for this aim. Consider the ASC system in hierarchical variable sectionfor example,

SOL-ASC(A,b) = β |‖β‖0 ∈ Z (A,b) ,

where Z (A,b) is defined as in Example 2. Each constraint in the form of |βi j|0 ≤ min|βi|0, |β j|0 defines a setwhose closure is the whole space. If one were to perform the closure operation to each constraint individually, and

then take the intersection, one would end up with

(β (1),β (2)) |

n

∑j=1|β (1)

j |0 ≤ K

. This set loses all hierarchical

information and is a proper superset of the actual closure: cl [SOL-ASC(A,b)] = (‖β (1)‖0,‖β (2)‖0)∈Z ≤(A,b)with Z ≤(A,b) characterized in (7).

3.3 Extension to SOL-xASC

In what follows, we extend two main results in the last subsection to an xASC system. In both extensions, weassume that b(γ) is a continuous function and Γ is a closed set. We first extend Proposition 3.

Proposition 6 The set SOL-xASC(A,b,Γ ) is closed if and only if

SOL-xASC(A,b,Γ ) =(β ,γ) ∈ Rn×Γ | A+‖β‖0 ≤ b(γ)

.

Proof. It suffices to prove that if SOL-xASC(A,b,Γ ) is closed, then it is contained in (β ,γ) | A+‖β‖0 ≤ b(γ).Let (β , γ) ∈ SOL-xASC(A,b,Γ ). Then β ∈ SOL-ASC(A,b(γ)), which must be closed. Hence A+‖β‖0 ≤ b(γ) byProposition 3.

Next is an extension of part (a) of Proposition 5. Part (b) is similar and omitted.

10

Proposition 7 Suppose that for every γ ∈ Γ , Assumption A holds for the pair (A,b(γ)). Then

cl [SOL-xASC(A,b,Γ ) ] =(β ,γ) ∈ Rn × Γ | ‖β‖0 ∈ H ≤(A,b(γ))

. (9)

Proof. Take any (β , γ) in the left-hand closure in (9). We claim that for every ε > 0, β belongs to the closure ofthe set β | A‖β‖0 ≤ b(γ)+ ε1, which must be a subset of

β | ‖β ‖0 ∈ H ≤(A, bε)

, where bε , b(γ)+ ε1.

Let (β ν ,γν) be a sequence in SOL-xASC(A,b,Γ ) that converges to (β , γ) as ν → ∞. By definition, we haveA‖β ν‖0 ≤ b(γν). Since b(γν) converges to b(γ), for any ε > 0, there exists ν such that for all ν ≥ ν , A‖β ν‖0 ≤b(γ)+ ε1. Since β is the limit of β ν, the claim holds. Hence, for every ε > 0 there exists zε ∈ [0,1]n such that

Azε ≤ b(γ)+ ε1 and ‖β‖0 ≤ zε .

It then follows by a continuity property of polyhedra that there exists z ∈ [0,1]n such that Az≤ b(γ) and ‖β‖0 ≤ z.In other words, (β , γ) belong to the right-hand set in (9).

Conversely, suppose that (β , γ) belong to the right-hand set in (9). By Proposition 5, it follows that β ∈cl [SOL-ASC(A,b(γ)) ]; so there exists β ν converging to β such that (β ν , γ) belongs to SOL-xASC(A,b,Γ ) forevery ν . Hence (β , γ) belong to the left-hand closure in (9).

4 Continuous Approximation and Convergence

It is known that the `0-function can be formulated by a complementarity condition [15]. As such, regularizationsand relaxations of the latter condition [21,24,28,30] can be employed as approximations of the former function.Another approach to deal with the discrete feature of the `0-function is by branch-and-bound for mixed-integer(non)linear programming; see e.g., [8,5]. When finite bounds of the variables are available, the indicator functionscan be modeled with binary variables using such bounds followed by a suitable branch-and-bound method foroptimization. However, an effective implementation of this approach is highly nontrivial and depends on manydetails, e.g., the way of choosing the bounds if they are known to exist only implicitly, the tradeoff between relax-ation strength and computational complexity, primal heuristics and branching rules. In general, tools of buildingrelatively tight lower bounds are rather limited [4], and can be computationally very expensive. In this section, weinvestigate the approximation of an (x)ASC system by another approach that has gained tremendous momentumin the field of statistical learning and sparsity representation where the nonzero count is a principal target to becontrolled in a regression/estimation model.

Inspired by the family of difference-of-convex functions that were designed specifically in the statistical learn-ing literature as surrogates of the `0-function (see [1,22] and the many references therein), we investigate theapproximation of the set SOL-xASC(A,b,Γ ) by replacing the univariate `0-function by a surrogate function pa-rameterized by a scalar that control the tightness of the approximation. Convergence of such approximated func-tions to the `0-function has been ascertained in [22] in the context of sparsity optimization where the `0-function ispart of the objective to be optimized. In contrast, our focus here is different from the latter reference in that we an-alyze the convergence of the approximated sets to the solution set of the xASC. This analysis is complicated by thefact that SOL-xASC is generally not closed; so the convergence pertains to the closure of this solution set whosecharacterization in Proposition 7 is key. Before introducing the approximation functions, we first summarize twokey notions of set convergence; see [29] for a comprehensive study of such convergence and the connection tooptimization.

For any two closed sets C and D in RN for some integer N > 0, the Pompeiu-Hausdorff (PH) distance is definedas:

distPH(C,D) , max

supx∈C

dist(x,D), supx∈D

dist(x,C)

,

where the point-set distance dist(x,D) is by definition equal to infy∈D

dist(x,y) with dist(x,y), ‖x−y‖ and ‖•‖ being

a given vector norm in RN . Let C(δ ) be a closed set in RN parameterized by δ ∈ ∆ where ∆ is a closed set in someEuclidean space. The family C(δ )

δ∈∆is said to converge to C(δ0) in the PH sense if distPH(C(δ ),C(δ0))→ 0 as

δ → δ0 ∈ ∆ . Equivalently, C(δ ) converges to C(δ0) in the PH sense if for any ε > 0, there is an open neighborhoodN of δ0, such that for all δ ∈N ,

C(δ ) ⊆ C(δ0)+ cl [IBN(0,ε)] and C(δ0) ⊆ C(δ )+ cl [IBN(0,ε)] .

11

The other notion of set convergence is that of Painleve-Kuratowski (PK) defined as follows. Again consider thecase of C(δ ) as δ → δ0; by definition, the outer and inner limits are, respectively,

limsupδ→δ0

C(δ ) ,

x | liminf

δ→δ0dist(x,C(δ )) = 0

liminfδ→δ0

C(δ ) ,

x | limsup

δ→δ0

dist(x,C(δ )) = 0

.

In other words, the outer limit limsupδ→δ0

C(δ ) contains all x such that there exist sequences δ k→ δ0 and xk→ x

such that xk ∈C(δ k) for all k . The inner limit liminfδ→δ0

C(δ ) contains all x such that for any sequence δ k → δ0,

there exists a sequence xk → x such that xk ∈C(δ k) for all k. It is easy to show that the inner limit is always asubset of the outer limit. The family C(δ )

δ∈∆is said to converge to C(δ0) as δ → δ0 in the PK sense if both of

the outer and inner limits are equal to C(δ0). It is proved in [29, Proposition 5.12] that PH convergence impliesPK convergence; but the converse is not true.

In later applications, we will be speaking about the convergence of a sequence of sets C(δ k) to a given setC(δ ∞) as the sequence δ k converges to a limit δ ∞. The definition of such sequential convergence is similar tothe above that applies to all δ near the base value δ ∞.

In general, set convergence has an important role to play in the convergence of optimal solutions to an opti-mization problem when the constraints are being approximated, like in the case of SOL-ASC(A,b) to be discussedin the next section. This is made precise in the result below; see [29, Section 7.E] where such convergence is dis-cussed in the framework of functional epi-convergence.

Proposition 8 Let f : Rn→ R be continuous. Let δ k be a sequence converging to δ ∞. If

limk→∞

C(δ k) = C(δ ∞) in the PK sense,

then limsupk→∞

argminx∈C(δ k)

f (x) ⊆ argminx∈C(δ ∞)

f (x).

Proof. Let xk be a sequence converging to x∞ such that xk ∈ argminx∈C(δ k)

f (x) for all k. It suffices to show two things:

(i) x∞ ∈C(δ ∞) and (ii) for every x ∈C(δ ∞), there exists xk converging to x such that xk ∈C(δ k) for all k. Bothfollow easily from the assumed convergence of C(δ k) to C(δ ∞).

While simple to prove and illustrative of the role of set convergence, Proposition 8 is only of conceptual im-portance because if the sets C(δ k) are nonconvex and/or the objective f (x) is nonconvex, then since global minimaof nonconvex optimization problems are generally not computable in practice, the convergence of a sequence ofminima cannot be used to deduce any property of the limit of a sequence of non-optimal solutions of the approx-imated problems. Instead, the convergence of computable stationary solutions should be investigated in order toeliminate the gap between practical computation and convergence of the computed solutions. This provides themotivation for Section 6 where we focus on a kind of stationary points that can be computed by methods of dcconstrained optimization problems, such as those presented in the reference [26], and investigate the convergenceof such stationary solutions. Before discussing this in detail, some preparatory work is needed that begins in thenext subsection.

4.1 Approximating functions

We consider the approximation of the univariate `0-function |• |0 by various surrogate functions that are motivatedby the families of surrogate sparsity functions summarized in [1,22]. Specifically, writing the constraint systemA‖β‖0 ≤ b(γ) as

n

∑j=1

A+i j |β j |0 ≤

n

∑j=1

A−i j |β j |0 +bi(γ), i = 1, · · · ,m,

we approximate each ASC constraint byn

∑j=1

A+i j p+j (β j,δ

+j ) ≤

n

∑j=1

A−i j p−j (β j,δ−j )+bi(γ), i = 1, · · · ,m, (10)

12

where δ±j are positive scalars and each p±j is a continuous bivariate function ρ : R× (0,∞)→ [0,1] satisfying

(R1) limδ↓0

ρ(t,δ ) = | t |0 for any t ∈ R [this limit is not required to be uniform in t];

(R2) for every δ > 0, ρ(•,δ ) is symmetric on R (i.e., ρ(t,δ ) = ρ(−t,δ ) for all t ≥ 0), and non-decreasing on[0,∞);

(R3) ρ(t,δ ) = 1 for all t such that |t| ≥ δ .

One special feature about the approximated system (10) is that we may use different approximation functionsp±j (•,δ

±j ) corresponding to the individual entries of the matrix A. We let SOL-xASCδ±(A,b,Γ ) denote the solu-

tion set of (10), emphasizing the control scalars δ±j of the approximating functions. We use Example 3 to show

that SOL-xASCδ±(A,b,Γ ) does not always converge to SOL-xASC(A,b,Γ ) as δ± ↓ 0. When this happens, alimiting solution of an optimization problem subject to the approximated constraints may not even be feasible tothe xASC system.Example 3 (cont.) We approximate the 2-variable system: |β1|0 ≤ 0.8 and |β2|0 ≤ 2|β1|0 using the capped `1-

function ρ(t,δ )=min(|t|δ,1)

for δ > 0, obtaining the approximated system consisting of the inequalities below:

min(|β1|δ

,1)≤ 0.8 and min

(|β2|δ

,1)≤ 2 min

(|β1|δ

,1). (11)

The solution set of the ASC, which is closed in this example, and that of the approximated system are depicted inFigure 1 below. Algebraically, the solution set of (11), for fixed δ > 0, is the union of two non-convex sets:

(β1,β2 ) | |β1 | ≤ 0.8δ , 2 |β2 | ≤ |β1 |⋃

(β1,β2 ) | 0.5δ ≤ |β1 | ≤ 0.8δ ︸︷︷︸the vertical stripes in the right-hand figure

.

Since the two vertical stripes are always contained in the above union for any δ > 0 and shrink to the verticalβ2-axis as δ ↓ 0, it is not difficult to see that this vertical axis is the limit of SOL-ASCδ as δ ↓ 0. Clearly, this axisis a proper superset of the solution set of the ASC system which is the singleton (0,0).

z1

z2

0 1

1

Az≤ bβ1

β2

δ−δ

δ

δ

Fig. 1: (a) Z ≤ = z ∈ 0,12 | Az≤ b= (0,0); (b) solution set of the approximated system (11)

We present below various surrogate functions summarized in [1] and briefly discuss their satisfaction of theassumptions (R1)–(R3).The SCAD family. Parameterized by two scalars a > 2 and λ > 0 and with the origin as its unique zero, thisunivariate smoothly clipped absolute deviation (SCAD) function is once continuously differentiable except at theorigin and given by: for all t ∈ R,

pSCADa,λ (t) ,

λ | t | if | t | ≤ λ

(a+1)λ 2

2− (aλ −| t |)2

2(a−1)if λ ≤ | t | ≤ aλ

(a+1)λ 2

2if | t | ≥ aλ .

13

To conform to the stated assumptions, we scale the function by the reciprocal of(a+1)λ 2

2and identify δ = aλ

and a being fixed, obtaining

ρSCADa (t,δ ) =

2a(a+1)δ

| t | if | t | ≤ δ

a

1− (δ −| t |)2(1− 1

a2

)δ 2

ifδ

a≤ | t | ≤ δ

1 if | t | ≥ δ

Note that while limδ↓0

ρSCADa (t,δ ) = |t|0 for any t ∈R, this limit is not uniform in t near 0. The same remark applies

to the following families of functions.The MCP family. Also parameterized by two positive scalars a > 2 and λ , this is a univariate, piecewise quadraticfunction given by:

pMCPa,λ (t) , 1

2

aλ

2− [ (aλ −| t |)+ ]2

a

for t ∈ R.

Again, to conform with the approximating conditions, we scale the function by2

aλ 2 and identify δ = aλ and abeing fixed, obtaining

ρMCP(t,δ ) =

2 | t |

δ− t2

δ 2 if | t | ≤ δ

1 if | t | ≥ δ

The capped `1 family. This has already been mentioned before; namely, for δ > 0,

ρC`1(t,δ ) = min

(| t |δ, 1), for t ∈ R.

The truncated transformed `1 family. Parameterized by a given scalar a > 0, this is a truncated modification ofthe transformed `1-function in [1] taking into account the scaling factor δ > 0 that tends to zero:

ρTTL1a (t,δ ) = min

((a+δ ) | t |δ (a+ | t |)

, 1), for t ∈ R.

It is easy to see that properties (R1) and (R2) hold for this function ρTTL1a (t,δ ) for (t,δ ) ∈R× (0,∞). Noting that

(a+δ ) | t |δ (a+ | t |)

= 1+a( | t |−δ )

δ (a+ | t |),

we may conclude that condition (R3) is also satisfied.The truncated logarithmic family. Derived similarly to the truncated transformed `1-functions and parameterizedby a scalar ε > 0, a truncated logarithmic penalty function is defined as follows: for δ > 0,

ρTlogε (t,δ ) , min

1

log(

1+δ

ε

) [ log( | t |+ ε )− log(ε)] , 1

, for t ∈ R.

It is not difficult to see that this function satisfies the desired conditions (R1), (R2), and (R3).In summary, we have identified a number of well-known univariate surrogate sparsity functions, suitably trun-

cated outside a δ -neighborhood of the origin, that satisfy the basic requirements of an approximating function ofthe `0-function.

14

4.2 Convergence

The main convergence result of the family of approximated solution sets SOL-xASCδ±(A,b,Γ )δ±>0 as δ±→ 0

is the following.

Theorem 1 Suppose that for every γ ∈ Γ , Assumption A holds for the pair (A,b(γ)). Assume further that for allδ±j > 0 sufficiently small, the surrogate functions in the family p±j (•,δ

±j )n

j=1 satisfy in addition to (R1), (R2),and (R3),

eithern

∑j=1

A−i j p−j (β j,δ−j ) ≤

n

∑j=1

A−i j p+j (β j,δ+j ), ∀β ∈ Rn and i = 1, · · · ,m. (12)

orn

∑j=1

A+i j p−j (β j,δ

−j ) ≤

n

∑j=1

A+i j p+j (β j,δ

+j ), ∀β ∈ Rn and i = 1, · · · ,m. (13)

Then the family SOL-xASCδ±(A,b,Γ )δ±>0 converges to cl [SOL-xASC(A,b,Γ )] in the PH sense as ‖δ±‖∞ ,

max

δ±j | j = 1, · · ·n

↓ 0.

Proof Let C(δ±) , SOL-xASCδ±(A,b,Γ ) and C0 , cl [SOL-xASC(A,b,Γ )]. We need to show that for everyε > 0, a scalar δ > 0 exists such that for all δ ∈ (0,δ ],

C(δ±) ⊆ C0 + cl [IBn+k(0,ε)] and C0 ⊆ C(δ±)+ cl [IBn+k(0,ε)] . (14)

By Proposition 7, we have

C0 = (β ,γ) ∈ Rn × Γ | ∃ z ∈ [0,1 ]n such that Az ≤ b(γ) and ‖β ‖0 ≤ z .

Let (β , γ) ∈ C(δ±). Then γ ∈ Γ . For the rest of the proof, we assume that (12) holds. A similar proof can beapplied when (13) holds. For all i = 1, · · · ,m

n

∑j=1

A+i j p+j (β j,δ

+j ) ≤

n

∑j=1

A−i j p−j (β j,δ−j )+bi(γ) ≤

n

∑j=1

A−i j p+j (β j,δ+j )+bi(γ),

By Assumption A, there exists z ∈ 0,1n such that Az ≤ b(γ) and z j = 1 whenever p+j (β j,δ+j ) = 1. Define a

vector β (δ ) with components given by

β j(δ ) ,

β j if |β j| ≥ δ

+j ; (thus p+j (β j,δ

+) = 1 )

sign(β j)δ+j z j if |β j| < δ

+j ,

where we define sign(0) = 1. It is easily seen that ‖β (δ )‖0 = z. Hence A‖β (δ )‖0 ≤ b(γ); so (β (δ ), γ) ∈C0. Foran index j such that |β j|< δ

+j , we have

β j(δ )− β j = sign(β j)[

δ+j z j−| β j |

],

implying that |β j(δ )− β j | = δ+j z j − | β j | ≤ δ

+j . Hence the first inclusion in (14) holds. To prove the second

inclusion, let (β , γ) ∈C0 be arbitrary. Define a vector β (δ ) with components given by:

β j(δ ) ,

β j if |β j| ≥ δ j or β j = 0

sign(β j)δ j otherwise,

where δ j ,max(

δ+j ,δ

−j

). It then follows that p±j (β j(δ ),δ

±j )= 1= |β j|0 unless β j = 0, in which case, p±j (β j,δ

±j )=

0= |β j|0. Since A‖β‖0≤ b(γ), it follows that β (δ )∈C(δ±). Since clearly |β j(δ )− β j | ≤ δ j, the second inclusionin (14) also holds.

15

Two special cases worth noting are when A≥ 0 or A≤ 0. It is easy to verify that Assumption A holds in bothcases (see discussion after Proposition 5); moreover, (12) holds in the former case and (13) holds in the latter case.Hence Theorem 1 is valid under the basic properties (R1), (R2) and (R3). In general, the two requirements (12)and (13) can be enforced by choosing p+j and p−j such that p−j (•,δ

−j )≤ p+j (•,δ

+j ) pointwise.

Our next convergence result pertains to the situation where we fix δ−j > 0 for all j = 1, · · · ,n and consider

only a one-sided approximation |β j|0 ≈ p+j (β j,δ+j ) with δ

+j ↓ 0. Specifically, we consider the approximation of

the systemn

∑j=1

A+i j |β j |0 ≤ bi(β ,γ) ,

n

∑j=1

A−i j p−j (β j,δ−j )+bi(γ), ∀ i = 1, · · · ,m,

where we fix δ−j and approximate | • |0 on the left side as said. In this case, convergence in the PK sense can be

established without Assumption A and with no restriction on the choice of the approximating functions p+j (•,δ+j )

except for the basic properties (R1), (R2), and (R3). Recognizing that the above system is a constraint system:A‖β‖0 ≤ b(β ,γ) with A being a nonnegative matrix and b being a continuous function of the pair (β ,γ) whose

dependence on

δ−j

n

j=1we have suppressed, we state and prove a version of Theorem 1 for such a system.

Proposition 9 Let A be a nonnegative matrix, b(β ,γ) be a continuous function, and Γ be a closed set. For eachj = 1, · · · ,m, let p+j (•,δ

+j ) be an approximating function satisfying conditions (R1), (R2), and (R3). Define for

any δ+ ,(

δ+j

)n

j=1> 0, the sets

C(δ+) ,

(β ,γ ) ∈ Rn×Γ |

n

∑j=1

A• j p+j (β j,δ+,kj ) ≤ b(β ,γ)

and C0 ,(β ,γ ) ∈ Rn×Γ | A‖β ‖0 ≤ b(β ,γ)

.

Then the family C(δ+)δ+>0 of converges to C0 in the PK sense as ‖δ+‖∞ ,max

δ+j | j = 1, · · ·n

↓ 0.

Proof Since C0 is a subset of C(δ+) for any δ+ > 0, it follows that the former set is a subset of liminfδ+↓0

C(δ+). It

remains to show that limsupδ+↓0

C(δ+) ⊆C0. Let (β ∞,γ∞) be the limit of a sequence(β k,γk)

where for each k,

the pair (β k,γk) satisfies:n

∑j=1

Ai j p+j (βkj ,δ

+,kj ) ≤ bi(β

k,γk) ∀ i = 1, · · · ,m,

corresponding to a sequence of positive scalars δ+,k ↓ 0. By the nonnegativity of A, property (R3), and thenonnegativity of of the approximating functions p+j (•,δ

+j ), we have for each i = 1, · · · ,m and all k sufficiently

large,n

∑j=1

Ai j |β ∞j |0 = ∑

j∈supp(β ∞)

Ai j ≤n

∑j=1

Ai j p+j (βkj ,δ

+,kj ) ≤ bi(β

k,γk).

Passing to the limit k→ ∞ establishes the desired inclusion.

5 Tangent Properties of SOL-ASC and its Approximation

For a given closed sets C⊆RN , the tangent cone of C at a vector x ∈C, denoted T (x;C) consists of vectors v suchthat there exist sequences of vectors xk⊆C converging to x and positive scalars τk converging to zero such that

v = limk→∞

xk− xτk

. Tangent vectors of closed sets play an important role in the stationarity conditions of optimization

problems constrained by such sets. In this section, we characterize the tangent vectors of SOL-ASC(A,b) and thoseof its approximation SOL-ASCδ±(A,b). We omit the extension (A,b,Γ ) in order to focus on the ‖ • ‖0-functionand its componentwise approximation by the penalty functions p±j (•,δ

±j ). Recalling that the tangent cone must

be a closed set, we have the following expression of the tangent cone of the SOL-ASC, which shows in particularthat the latter cone is also defined by an ASC. We also obtain a superset of the tangent cone in the case where the

16

matrix A is nonnegative which will be useful subsequently. For a given β ∈ SOL-ASC(A,b), we let AASC(β ) bethe set of indices i ∈ 1, · · · ,m such that Ai•‖β‖0 = bi.

Proposition 10 Let β ∈ SOL-ASC(A,b). Let Sc be the complement of S, supp(β ) in 1, · · · ,n. It holds that

T(β ;SOL-ASC(A,b)

)= cl

[v | A•Sc ‖vSc ‖0 ≤ b−A‖ β ‖0

]= cl

[v |(

βSvSc

)∈ SOL-ASC(A,b)

].

(15)

Thus if A≥ 0, then

T(β ;SOL-ASC(A,b)

)⊆

v | v j = 0 for all j ∈ Sc such that ∃ i ∈AASC(β ) with Ai j > 0.

Proof The equality of the two closures is easy to see. To prove the equality of the tangent cone and the first closure,write C , SOL-ASC(A,b). We first show that

v | A•Sc ‖vSc ‖0 ≤ b−A‖ β ‖0⊆ T (β ;C).

This is enough to imply that the left-hand tangent cone in (15) contains the right-hand closures in the sameexpression. Let v be an arbitrary vector satisfying A•Sc ‖vSc ‖0 ≤ b−A‖β‖0. For all τ > 0 sufficiently small, wehave ‖β + τv‖0 = max

(‖β‖0, ‖v‖0

). So

A‖ β + τ v‖0 = A‖ β ‖0 +A•Sc ‖vSc ‖0 ≤ b,

which implies that v is a tangent vector of C at β . Conversely, let d ∈T (β ;C) be given. Let β k⊆C and τk ↓ 0be such that

limk→∞

βk = β and lim

k→∞

β k− β

τk= d.

Clearly,

∥∥∥∥∥ β kSc − βSc

τk

∥∥∥∥∥0

= ‖β kSc ‖0; moreover, for all k sufficiently large,

b ≥ A‖βk ‖0 = A‖ β ‖0 +A•Sc ‖β

kSc ‖0 = A‖ β ‖0 +A•Sc

∥∥∥∥∥ β kSc − βSc

τk

∥∥∥∥∥0

.

Hence, it follows readily that d belongs to the right-hand closure in (15).To prove the last assertion of the proposition, let A ≥ 0. Let v satisfy A•Sc‖vSc‖0 ≤ b−A‖β‖0. For every

i ∈AASC(β ), we have∑

j 6∈S and v j 6=0

Ai j ≤ 0.

Therefore if j ∈ Sc and there is an i ∈AASC(β ) with Ai j > 0, then we must have v j = 0.

Proposition 10 yields two interesting properties of T(β ;SOL-ASC(A,b)

)that can be contrasted with the

tangent cone of the polyhedron P(A,b), w ∈ Rn | Aw≤ b. First, it is known that for a given w ∈P(A,b), wehave

T (w;P(A,b)) = v ∈ Rn | Ai•v ≤ 0 ∀ i ∈ A (w)

where A (w) is the index set of the active constraints at w ∈P(A,b). It follows from this representation ofT (w;P(A,b)) that v is a tangent vector of P(A,b) at w if and only if w+τv ∈P(A,b) for all τ > 0 sufficientlysmall. In contrast, Proposition 10 shows that a vector v is a tangent of the set SOL-ASC(A,b) at β if and only if itis the limit of a sequence vk for which a scalar τ > 0 and an integer k > 0 exist such that for all τ ∈ (0, τ] andall k ≥ k, β + τ vk ∈ SOL-ASC(A,b). From this, it follows that if v is a tangent of the set SOL-ASC(A,b), thenβ + τv ∈ cl[SOL-ASC(A,b)] for all τ > 0 sufficiently small. Thus the tangents of SOL-ASC(A,b) have exactlythe same feasibility property as those of P(A,b) provided that SOL-ASC(A,b) is closed. Another interestingconsequence of Proposition 10 is that the entire set of constraints A‖β‖0 ≤ b and the (in)activity of the sparsityconstraints |β j| ≥ 0, j = 1, · · · ,n are all involved in the definition of the tangent vectors of SOL-ASC(A,b). Incontrast, T (w;P(A,b)) involves only the active constraints at w of the system Aw≤ b.

17

Example 4 We illustrate Proposition 10 for a pair of upper and lower cardinality constraints: for two positiveintegers k < k,

k ≤n

∑j=1|β j |0 ≤ k. (16)

Let C be the solution set of this ASC. Let β be a given vector satisfying (16). There are 2 cases:

•n

∑j=1| β j |0 = k. In this case, it is not difficult to show that T (β ;C) =

v | v j = 0 for all j ∈ Sc

.

•n

∑j=1| β j |0 < k. In this case, it is not difficult to show that T (β ;C) =

v | ∑

j 6∈S

|v j |0 ≤ k−n

∑j=1| β j |0

.

The noteworthy point of this illustrative example is that the lower cardinality constraint does not enter into therepresentation of the tangent cone, regardless of whether this constraint is binding or not. More generally, forthe ASC(A,b), it is easy to see that if a row Ai• contains no positive entries, then the constraint AiSc‖vSc ‖0 ≤bi−Ai•‖ β ‖0 is trivially satisfied, thus redundant, in the representation of the tangent cone of the solution set ofthe ASC at β .

5.1 Tangent cone of approximated sets: Fixed δ

Consider the approximated SOL-ASCδ±(A,b) with each approximating function p±j (•,δ±j ) given by: for some

positive integer K j,p±j (t,δ

±j ) , λ

±j | t |− max

1≤k≤K jg±jk(t)︸︷︷︸

, g±j (t)

, t ∈ R, (17)

where each λ±j is a positive scalar and each g±jk is a univariate differentiable convex function, all dependent on

δ±j

n

j=1. As proved in [1], the surrogate sparsity functions in the SCAD, MCP, capped `1, the transformed `1,

and the logarithmic families can all be expressed in the above difference-of-convex (dc) form. From the expression:

min(λ | t |−g(t), 1) = λ | t |−max(g(t), λ | t |−1) ,

we see that the truncation of a dc function of the above form can be represented in the same form; thus, dc functionsgiven by (17) also include those in the truncated transformed `1 and truncated logarithmic families; thus all thefunctions discussed in Subsection 4.1 are covered by the form (17).

With p±j (•,δ±j ) as given, the inequality

n

∑j=1

A+i j p+j (β j,δ

+j ) ≤

n

∑j=1

A−i j p−j (β j,δ−j )+bi

can be written very simply asφi(β )−ψi(β )︸︷︷︸

ζi(β )

−bi ≤ 0, (18)

where

φi(β ) ,n

∑j=1

[A+

i j λ+j |β j |+A−i j g−j (β j)

]and ψi(β ) ,

n

∑j=1

[A−i j λ

−j |β j |+A+

i j g+j (β j)]

are convex functions. Thus, SOL-ASCδ±(A,b) is a “dc set”; i.e., it has the representation:

SOL-ASCδ±(A,b) = β | ζi(β ) ≤ bi, ∀ i = 1, · · · ,m .

We recall that the directional derivative of a function ζ at a given vector β in the direction v is given by:

ζ′(β ;v) , lim

τ↓0

ζ (β + τv)−ζ (β )

τ

if the limit exists. The following representation of the tangent cone of SOL-ASCδ±(A,b) is directly adopted fromCorollary 1 and Proposition 3 in [26] where a proof can be found.

18

Proposition 11 Let β ∈ SOL-ASCδ±(A,b) be given. Let each p±j (•,δ±j ) be given by (17). Let

Aδ±(β ) ,

i |

n

∑j=1

A+i j p+j (β j,δ

+j ) =

n

∑j=1

A−i j p−j (β j,δ−j )+bi

.

If each function g−jk is linear for all j = 1, · · · ,n and k = 1, · · · ,K j, it holds that

T(β ;SOL-ASCδ±(A,b)

)=

v | ζ′

i (β ;v) ≤ 0, ∀ i ∈ Aδ±(β )

; (19)

hence T(β ;SOL-ASCδ±(A,b)

)is the union of finitely many closed convex cones.

The linearity of the functions g−jk implies that each approximating function p−j (•,δ−j ) is piecewise linear

although no such piecewise linearity is required on p+j (•,δ+j ). Among the five families: SCAD, MCP, capped `1,

truncated transformed `1, and truncated logarithmic, only the capped `1 function is piecewise linear; the SCADand MCP functions are differentiable on the real line except at the origin; the latter two truncated functions havetwo additional non-differentiable points at t =±δ .

In terms of the functions p±j (•,δ±j ), the expression (19) yields the following:

T(β ;SOL-ASCδ±(A,b)

)=

v |

n

∑j=1

A+i j p+j (•,δ

+j )′(β j;v j) ≤

n

∑j=1

A−i j p−j (•,δ−j )′(β j;v j), ∀ i ∈ Aδ±(β )

.

(20)

Unlike the tangent cone of SOL-ASC (cf. (15)), no closure is needed in the above right-hand set because this setis already closed. We provide an example showing that the equality (19) can fail without the linearity assumptionon the functions g−jk in Proposition 11.Example 1 cont. Consider the system |β1|0 ≤ |β2|0 at the feasible point β = (2,2). We approximate both `0functions by the MCP functions with a fixed δ = 2 as follows: for i = 1,2,

ρi(βi) = |βi |−gi(βi), where gi(t) ,

t2/4 if | t | ≤ 2| t |−1 if | t | ≥ 2.

It is easy to check that gi is convex and continuously differentiable on R. Moreover, ρi(βi) is differentiable every-where except at βi = 0. The dc set is

(β1,β2 ) ∈ R2 | ρ1(β1) ≤ ρ2(β2)=(β1,β2 ) ∈ R2 | |β1 | ≤ |β2 | or |β2 | ≥ 2

. (21)

Since ρ ′i (2) = 0, it follows that (v1,v2 ) ∈ R2 | ρ

′1(2)v1 ≤ ρ

′2(2)v2

= R2. (22)

In contrast, the actual tangent cone of the set (21) at the given β = (2,2) is:(d1,d2 ) ∈ R2 | d1 ≤ d2 or d2 ≥ 0

which is clearly a subset of (22).

6 Convergence of B-Stationary Solutions

In this section, we apply the results derived in the above sections to address the convergence of stationary solutions.Consider the following optimization problem:

minimizeβ

f (β ) , h(β )−g(β ) subject to β ∈ C , cl [SOL-ASC(A,b)] , (23)

where h is a convex function (not necessarily differentiable) and g is a continuously differentiable convex function,both defined on an open set Ω containing the feasible set C. Thus f is a difference-of-convex (dc) function. Thenon-convexity of cl [SOL-ASC(A,b)] adds complications to the problem. As a non-convex optimization problem,we cannot realistically hope to be able to compute a local minimizer easily, let alone a global minimizer. At best,

19

a stationary solution of some kind is what an iterative algorithm can approximately compute. In general, station-arity is a necessary condition for the local minimizing property. For a non-convex non-differentiable optimizationproblem, there are many kinds of stationary solutions. When the problem is constrained by a non-convex set, oneneeds to consider a constrained notion of stationarity. As detailed in [26], to deal with a dc optimization problemwith a non-convex non-differentiable feasible set, the sharpest kind of stationary solutions is that based on theelementary directional derivatives of the objective functions (which are well defined for a dc function) and thetangent cone of the constraint set, sharpest in the sense that the resulting stationary solution must satisfy all otherdefinitions of stationarity. The cited reference contains more details about such a stationary point, which is termeda B(ouligand) stationary solution, and discussion about its computation.

By definition, a feasible vector β ∈C is a B(ouligand) stationary solution of the problem (23) if

h ′(β ;v)−∇g(β )T v = f ′(β ;v) ≥ 0, ∀v ∈ T (β ;C).

We attempt to approximate such a stationary solution β by solving the approximated problem:

minimizeβ

f (β ) subject to β ∈ C(δ±) , SOL-ASCδ±(A,b), (24)

where the feasible region is defined by the family of approximating functions

p±j (•,δ±j )n

j=1each satisfying

assumptions (R1), (R2), and (R3) as well as condition (12) or (13).Let δ±;k be a sequence of positive scalars converging to zero. For each k, let β k be a B-stationary solution

of (24) corresponding to δ±;k; i.e.,

f ′(β k;v) ≥ 0, ∀v ∈ T (β k;C(δ±;k)).

Suppose that the sequence β k converges to β , which must necessarily belong to C by the convergence ofC(δ±;k) to C. The question is whether β is a B-stationary solution of (23). For this question to have an affir-mative answer, it suffices to identify for any v ∈ T (β ;C) an infinite index set κ ⊆ 1,2, · · · and a sequence oftangents vkk∈κ with vk ∈ T (β k;C(δ±;k)) for each k such that vkk∈κ converges to v. Indeed, if such vk exist,then using the fact that h ′(β ; v)≥ limsup

k(∈κ)→∞

h ′(β k;vk), by the convexity of h, we deduce,

f ′(β ; v) ≥ limsupk(∈κ)→∞

f ′(β k;vk),

from which the desired B-stationarity of β follows readily. If f is differentiable, then it suffices for ∇ f (β k)T vk→∇ f (β )T v. Before constructing the desired sequence vk, we provide an example to illustrate that this may notalways be possible, thus establishing the failed convergence of such stationary solutions in general.

Example 5 Consider the 2-variable optimization problem:

minimizeβ1,β2

12

[(β1−1)2 +(β2−1 )2 ] subject to |β1 |0 + |β2 |0 ≤ 1. (25)

We approximate the constraint by the capped `1 surrogate function, obtaining the approximated problem: forδk > 0,

minimizeβ1,β2

12

[(β1−1)2 +(β2−1 )2 ] subject to min

(|β1 |δk

, 1

)+min

(|β2 |δk

, 1

)≤ 1. (26)

It is not difficult to verify that β (δk) ,

(δk

2,

δk

2

)is a B-stationary solution of the latter problem. Yet the limit

(0,0) is not a B-stationary solution of the original problem (25), by noting that the tangent cone of the feasibleregion at this limit point is equal to the feasible region itself. For this example, we note that the constraint in (26) isbinding at the approximated pair β (δk) but the constraint in (25) is not binding at the limit (0,0). Incidentally, thissituation will not happen with a polyhedron under perturbation but happens here partly due to the discontinuity ofthe `0 function.

20

β1

β2

contours of f

C

β

C(δk)

β k

∇ f (β k)

v1

v2

T (β ;C)

T (β k;C(δk))

unique normal

to T (β ;C)

C = (β1,β2) | |β1|0 + |β2|0 ≤ 1

C(δk) =

(β1,β2) | min

(|β1|δk

,1

)+min

(|β2|δk

,1

)≤ 1

T (β ;C) = (v1,v2) | |v1|0 + |v2|0 ≤ 1

T (β k;C(δk)) = (v1,v2) | v1 + v2 ≤ 0

Fig. 2: Illustration of Example 5 using iterates

For operational purposes, we assume in the rest of the paper that the representation (20) of the tangent coneT (β k;C(δ±;k)) of the approximated set C(δ±;k) is valid. To facilitate the identification of the desired approx-imating tangents, we write the two cones T (β ;C) and T (β k;C(δ±;k)) as follows. First, let S , supp(β ) withcomplement Sc. We have

T (β ;C) = cl

[v | ∑

j∈Sc

Ai j |v j |0 ≤ bi−n

∑j=1

Ai j | β j |0, ∀ i = 1, · · · ,m

]

while T(β k;C(δ±;k)

)=

v |

n

∑j=1

A+i j p+j (•,δ

+;kj ) ′(β k

j ;v j) ≤n

∑j=1

A−i j p−j (•,δ−;kj ) ′(β k

j ;v j), ∀ i ∈ Aδ±;k(β k)

.

For any vector v ∈ Rn, let

V=0 ,

j ∈ Sc | j 6∈ supp(v)

and V 6=0 ,

j ∈ Sc | j ∈ supp(v)

whose union is the complement Sc of the support of the vector β . We divide the sumn

∑j=1

A±i j p±j (•,δ±;kj ) ′(β k

j ;v j)

according to a given vector v ∈T (β ;C) and the two associated index sets V=0 and V 6=0:

n

∑j=1

A±i j p±j (•,δ±;kj ) ′(β k

j ;v j) = ∑j∈S

A±i j p±j (•,δ±;kj ) ′(β k

j ;v j)+T±k;0(v)+T±k; 6=0(v),

whereT±k;0(v) , ∑

j∈V=0

A±i j p±j (•,δ±;kj ) ′(β k

j ;v j) and T±k; 6=0(v) , ∑j∈V 6=0

A±i j p±j (•,δ±;kj ) ′(β k

j ;v j).

Letting δ±;kj = δk for all j = 1, · · · ,n, we can write

T±k;0(v) =1

δk∑

j∈V=0

A±i j

[δk p±j (•, δk)

′(β kj ;v j)

]

and T±k;6=0(v) =1

δk∑

j∈V6=0

A±i j

[δk p±j (•, δk)

′(β kj ;v j)

].

21

Under assumption (R3) of the functions p±j (•,δ ), we deduce the following two one-sided derivatives for all ksufficiently large and all j ∈ S:

p±j (•, δk)′(t;±1) = 0, ∀ t such that | t | > δk.

For an index j ∈ S, since β kj → β j 6= 0 and δk → 0, it follows that for all but finitely many k, β k

j > δk.

Hence, under the stipulation that δ±;kj = δk for all j = 1, · · · ,n, it follows that for all k sufficiently large, v ∈

T(β k;C(δ±;k)

)if and only if for all i ∈A

δ±;k(β k),

∑j∈V=0

A+i j

[δk p+j (•, δk)

′(β kj ;v j)

]+ ∑

j∈V 6=0

A+i j

[δk p+j (•, δk)

′(β kj ;v j)

]

≤ ∑j∈V=0

A−i j

[δk p−j (•, δk)

′(β kj ;v j)

]+ ∑

j∈V6=0

A−i j


′(β kj ;v j)

] (27)

This can be contrasted with the necessary and sufficient condition for v ∈T (β ;C), which is:

∑j∈V6=0

A+i j ≤ ∑

j∈V 6=0

A−i j +

[bi−∑

j∈S

Ai j

], ∀ i = 1, · · · ,m, (28)

provided that the set of vectors v satisfying the latter inequalities is closed. This is the case for instance when thematrix A is nonnegative as we will assume in Subsection 6.1. One obvious difference between (27) and (28) is thatthe components v j, for j ∈ V=0, of the tangent v of the approximated set C(δ±;k) appear explicitly in the former,whereas the same components of v do not in the latter.

At this point, it would be useful to provide the directional derivatives of the surrogate sparsity functions dis-cussed in Subsection 4.1, in particular the derived expressions will verify the following properties of ρ(•,δ ) ′(t;±1)for all δ > 0 and all nonzero s ∈ R, namely,

(R4a) sign [ρ(•,δ ) ′(t;s) ]= sign(s)sign(t) if 0 < | t | < δ

= 1 if t = 0;

(R4b) ρ(•,δ ) ′(t;s)= 0 if st > 0≤ 0 if st < 0

for |t|= δ ; and

(R4c) ρ(•,δ ) ′(t;s) = 0 for |t|> δ .The function ρ(•,δ ) is not differentiable at the origin and possibly at ±δ (see e.g. the truncated transformed

`1-function below). By definition, we must have ρ(•,δ ) ′(t;0) = 0 for all t. Moreover, for st 6= 0, ρ(•,δ ) ′(t;s)> 0if and only if |t|< δ and st > 0.

In what follows, we give expressions of the directional derivatives of three such functions: SCAD, MCP, andthe truncated transformed `1, and omit the other two: capped `1 and truncated logarithmic.The SCAD family. We have

ρSCADa (•,δ ) ′(t;±1) =

±sign(t)2a

(a+1)δif 0 < | t | ≤ δ

a

2a(a+1)δ

if t = 0

±sign(t)2 (δ −| t |)(1− 1

a2

)δ 2

ifδ

a≤ | t | ≤ δ

0 if | t | ≥ δ ,

22

which is continuously differentiable at all nonzero t ∈ R. Thus

δ ρSCADa (•,δ ) ′(t;s) =

ssign(t)2a

a+1if 0 < | t | ≤ δ

a

|s | 2aa+1

if t = 0

ssign(t)2(

1− | t |δ

)1− 1

a2

ifδ

a≤ | t | ≤ δ

0 if | t | ≥ δ .

The MCP family. We have

ρMCP(•,δ ) ′(t;±1) =

±sign(t)

(2δ− 2 | t |

δ 2

)if 0 < | t | ≤ δ

2δ

if t = 0

0 if | t | ≥ δ ,

which has the same differentiability properties as a SCAD function. Moreover,

δ ρMCP(•,δ ) ′(t;s) =

ssign(t)

(2− 2 | t |

δ

)if 0 < | t | ≤ δ

2 |s | if t = 0

0 if | t | ≥ δ .

The truncated transformed `1 family. We have

ρTTL1a (•,δ )′(t;±1) =

±sign(t)a(a+δ )

δ (a+ | t |)2 if 0 < | t | < δ

a+δ

δ aif t = 0

min(±sign(t)

aδ (a+δ )

, 0)

if | t | = δ

0 if | t |> δ ,

yielding

δ ρTTL1a (•,δ )′(t;s) =

ssign(t)a(a+δ )

(a+ | t |)2 if 0 < | t | < δ

|s | a+δ

aif t = 0

min(

ssign(t)a

a+δ, 0)

if | t | = δ

0 if | t | > δ .

6.1 The case A≥ 0

Using the last inclusion of the tangent cone in Proposition (10), this nonnegative case is relatively easy to dealwith.

23

Proposition 12 Let A≥ 0 and f = h−g be a dc function with g and h both convex and g additionally continuouslydifferentiable. Let δ

±;kj = δk for all j = 1, · · · ,n and all k. Let lim

k↓0δk = 0. For each k, let β k be a B-stationary

solution of the problem:

minimizeβ

f (β )

subject to β ∈ C(δk) ,

β |

n

∑j=1

Ai j p+j (β j, δk) ≤ bi, ∀ i = 1, · · · ,m

.

where each surrogate function p+j (•, δk) satisfies conditions (R1)–(R4). If β is the limit of β k satisfying theproperty:(

A≥0β

): if i is such that

n

∑j=1

Ai j p+j (βkj , δk) = bi (29)

for infinitely many k, then i ∈AASC(β ),

then β is a B-stationary solution of (23).

Proof Let v ∈T (β ;C) with C , SOL-ASC(A,b). It suffices to construct a sequence vk and identify an infiniteindex set κ such that the subsequence vkk∈κ converges to v and for all k ∈ κ sufficiently large,

0 ≥ ∑j∈V=0

Ai j

[δk p+j (•, δk)

′(β kj ;vk

j)]+

∑j∈V 6=0

Ai j

[δk p+j (•, δk)

′(β kj ;vk

j)]∀ i such that (29) holds. (30)

Define the components vkj as follows:

vkj ,

v j if either j ∈ S or j ∈ V6=0

−β kj if j ∈ V=0

For every k, there is a (possibly empty) index set Ak of constraints i such that (29) holds corresponding to thepair (i,k). Since there are only finitely constraints, there exists an infinite subset κ of 1,2, · · · such that Ak is aconstant set, say ¯A , for all k ∈ κ . By assumption

(A≥0

β

), we have ¯A ⊆AASC(β ). It follows from Proposition 10

that v j = 0 provided that there exists an i ∈ AASC(β ) such that Ai j > 0. Thus, for every index j ∈ V6=0, we musthave Ai j = 0 for all i ∈AASC(β ). Hence, the requirement (30) for the sequence vkk∈κ reduces to

0 ≥ ∑j∈V=0

Ai j

[δk p+j (•, δk)

′(β kj ;vk

j)]∀ i ∈ ¯A and all k ∈ κ.

By (R4), we have, for j ∈ V=0 with β kj 6= 0, sign

[p+j (•, δk)

′(β kj ;vk

j)]= sign(vk

j)sign(β kj ) =−1, by the choice of

vkj. Hence it follows that vk ∈T

(β k;SOL-ASC

δk(A,b)

)for all k ∈ κ . It remains to show that vkk∈κ converges

to v. But this is clear from the definition of the components vkj and the fact that β k

j → β j = 0 = v j for all j ∈ V=0.

24

6.2 The case of column-wise uni-sign

In this subsection, we assume that the objective function f in (23) is continuously differentiable (C1) so that wecan focus on the choice of the sequence of approximate tangents. Let W (β ) be the set of vectors v satisfying theinequalities in (28). Since the closure of W (β ) is equal to T (β ,SOL-ASC(A,b)), it follows that a necessary andsufficient condition for β to be a B-stationary solution of (23) is that

0 ≤ ∇ f (β )T v = ∑j∈S

∂ f (β )∂β j

v j + ∑j∈V 6=0

∂ f (β )∂β j

v j, ∀ v ∈ W (β ). (31)

In what follows, we obtain a necessary and sufficient condition for a given β ∈ SOL-ASC(A,b) to be a B-staionarysolution of (23), based on which we will address the convergence of the B-stationary solutions of the approximatedproblems. For this purpose, define

J (β ) ,

J ⊆ Sc | ∑

j∈JA• j ≤ b−A‖ β ‖0

.

This family J (β ) collects all index sets V 6=0 corresponding to the tangents v in T (β ;SOL-ASC(A,b)).

Lemma 1 Let f be a C1 function defined on an open set containing SOL-ASC(A,b). A vector β ∈ SOL-ASC(A,b)is a B-stationary solution of (23) if and only if

∂ f (β )∂β j

= 0, ∀ j ∈ supp(β ) ∪⋃

J∈J (β )

J. (32)

Proof “If.” Associated with every vector v ∈W (β ) is an index set J satisfying ∑j∈J

A• j ≤ b−A‖ β ‖0. By assump-

tion,∂ f (β )

∂β j= 0 for all j ∈ J. The equality in the expression for ∇ f (β )T v in (31) and the assumption (32) easily

yields the B-stationarity of β .“Only if.” Suppose that the inequality in (31) holds. Any vector v with V 6=0 = /0 belongs to the set W (β ). Thus,

we must have

∑j∈S

∂ f (β )∂β j

v j ≥ 0, ∀ vS.

This implies that∂ f (β )

∂β j= 0 for all j ∈ S. Hence the B-stationarity of β reduces to

∑j∈V6=0

∂ f (β )∂β j

v j ≥ 0, ∀ ( v j ) j∈V6=0satisfying ∑

j∈V 6=0

A• j ≤ b−A‖ β ‖0.

Let j ∈ J for some J ∈J (β ). We have ∑j ′∈J

A• j ′ ≤ b−A‖ β ‖0. For any scalar ε > 0, define the vectors vε;± as

follows:

vε;±j ,

0 if j 6∈ J

ε if j ∈ J and j 6= j

±1 if j = j.

It is not difficult to see that

j 6∈ S | vε;±j 6= 0

= J. Hence we have

0 ≤ ε ∑j∈J\ j

∂ f (β )∂β j

± ∂ f (β )∂β j

.

25

Since this holds for all ε > 0, it follows by passing to the limit ε ↓ 0 that

± ∂ f (β )∂β j

≥ 0, yielding∂ f (β )

∂β j= 0,

establishing that (32) is necessary for B-stationarity.

Based on the above lemma, we can establish the following result when the matrix A satisfies the column-wiseuni-sign property. With this special structure on A, we may divide the columns of A into two groups: C⊕ whoseentries are all nonnegative, and C whose entries are all nonpositive. The noteworthy part in the proof of theresult is that not all components of the sequence vk converge to the corresponding components of v (those withindices in the sets V ;⊕

6=0 do not; see notation below), but we must have ∇ f (β k)T vk → ∇ f (β )T v restricted to anappropriate subsequence; this is sufficient to establish the desired B-stationarity of the limit β .

Theorem 2 Let A have the column-wise uni-sign property and f be a C1 function defined on an open set contain-ing SOL-ASC(A,b). Let δ

±;kj = δk for all j = 1, · · · ,n and all k. Let lim

k↓0δk = 0. For each k, let β k be a B-stationary

solution of the problem:

minimizeβ

f (β )

subject to β ∈ C(δk) ,

β |

n

∑j=1

A+i j p+j (β j, δk) ≤ bi +

n

∑j=1

A−i j p−j (β j, δk), ∀ i = 1, · · · ,m

,

where each pair of surrogate functions p±j (•, δk) satisfies conditions (R1)–(R4) and either (12) or (13). If β is thelimit of β k satisfying the property that

(Aβ

) if i is such thatn

∑j=1

A+i j p+j (β

kj , δk) = bi +

n

∑j=1

A−i j p−j (βkj , δk) (33)

for infinitely many k, then i ∈AASC(β ),

then β is a B-stationary solution of (23) if and only if

∂ f (β )∂β j

= 0, ∀ j ∈⋃

J∈J (β )

J such that ∃ i ∈ AASC(β ) with Ai j 6= 0. (34)

Proof It suffices to prove the “if” statement. We proceed as in the proof of Proposition 12. For every k, there is a(possibly empty) index set Ak of constraints i such that (33) holds corresponding to the pair (i,k). Since there areonly finitely constraints, there exists an infinite subset κ of 1,2, · · · such that Ak is a constant set, say ¯A , for allk ∈ κ . By assumption (A

β), we have ¯A ⊆AASC(β ). Define the components vk

j as follows:

vkj ,

v j if j ∈ S

β kj if j ∈ V=0 ∩ C , V =0

−β kj if j ∈ V=0 ∩ C⊕ , V ⊕=0

v j if j ∈ V 6=0 and Ai j = 0 ∀ i ∈AASC(A,b); denoted j ∈ V 06=0

0 if j ∈ V 6=0 ∩ C and ∃ i ∈ AASC(A,b) such that Ai j 6= 0; denoted j ∈ V 6=0

−β kj if j ∈ V 6=0 ∩ C⊕ and ∃ i ∈ AASC(A,b) such that Ai j 6= 0; denoted j ∈ V ⊕6=0.

Note that the components vkj for j ∈ V ;⊕

6=0 do not necessarily converge to v j, whereas all other components do.Taking into account the column-wise uni-sign property of A, the inequality in (27) for i ∈ ¯A , which is a subset of

26

AASC(A,b), can be written equivalently as

∑j∈V ⊕=0

A+i j

[δk p+j (•, δk)

′(β kj ;vk

j)]+ ∑

j∈V ⊕6=0

A+i j

[δk p+j (•, δk)

′(β kj ;vk

j)]

≤ ∑j∈V =0

A−i j


′(β kj ;vk

j)]+ ∑

j∈V 6=0

A−i j


′(β kj ;vk

j)]

︸︷︷︸= 0 since vk

j = 0 for all such j

.

By the definition of vkj and property (R4) of the directional derivatives p±j (•, δk)

′(β j;•) of the surrogate functions,it follows that for all i ∈ ¯A , the left-hand sum is nonpositive while the right-hand sum is nonnegative. Thus, theabove-defined vector vk ∈T (β k;SOL-ASC

δk(A,b)) for all k ∈ κ sufficiently large. Hence, we have for all such k,

0 ≤ ∇ f (β k)T vk = ∑j∈S

∂ f (β k)

∂β jvk

j + ∑j∈V=0

∂ f (β k)

∂β jvk

j + ∑j∈V6=0

∂ f (β k)

∂β jvk

j

= ∑j∈S

∂ f (β k)

∂β jv j︸︷︷︸

converges to

∑j∈S

∂ f (β )∂β j

v j

+ ∑j∈V =0

∂ f (β k)

∂β jβ

kj − ∑

j∈V ⊕=0

∂ f (β k)

∂β jβ

kj︸︷︷︸

converges to 0

+ ∑j∈V 6=0

∂ f (β k)

∂β jvk

j.

We can write

∑j∈V 6=0

∂ f (β k)

∂β jvk

j = ∑j∈V 6=0

∂ f (β k)

∂β jvk

j + ∑j∈V ⊕6=0

∂ f (β k)

∂β jvk

j + ∑j∈V 06=0

∂ f (β k)

∂β jvk

j

= − ∑j∈V ⊕6=0

∂ f (β k)

∂β jβ

kj + ∑

j∈V 06=0

∂ f (β k)

∂β jv j

converges to ∑j∈V 06=0

∂ f (β )∂β j

v j, by (34).

Hence

∇ f (β k)T vk

k∈κ→ ∇ f (β )T v as desired.

The condition (34) is void when A is a nonnegative matrix. At this time, we are not able to address the casewhere there are nonzero entries in some columns of A that have mixed signs.

Remark 3 The assumptions (A≤β) and (A

β) in Proposition 12 and Theorem 2, respectively, are not expected to be

easily verifiable in practice. Nevertheless, they provide an explanation of the failure of convergence in Example 5;more importantly, these results show that the persistent holding of binding constraints in the limit is essential forthe convergence of the B-stationary solutions of the approximated problems to a desired B-stationary solution ofthe ASC constrained problem (23).

References

1. M. AHN, J.S. PANG, AND J. XIN. Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM Journal onOptimization, revision under review (as of February 2017).

2. A. D’ASPREMONT, O. BANERJEE, AND L. EL GHAOUI. First-order methods for sparse covariance selection. SIAM Journal on MatrixAnalysis and its Applications 30(1): 55–66 (2008).

3. F. BACH, R. JENATTON, J. MAIRAL, AND G. OBOZINSKI. Structured sparsity through convex optimization. Statistical Science 27(4):450–468 (2012).

4. P. BELOTTI, C. KIRCHES, S. LEYFFER, J. LINDEROTH, J. LUEDTKE, AND A. MAHAJAN. Mixed-integer nonlinear optimization. ActaNumerica 22: 1–131 (2013).

27

5. D. BERTSIMAS, A. KING, AND R. MAZUMDER. Best subset selection via a modern optimization lens. The Annals of Statistics 44(2):813–852 (2016).

6. D. BERTSIMAS AND R. SHIODA. Algorithm for cardinality-constrained quadratic optimization. Computational Optimization and Ap-plications 43(1): 1–22 (2009).

7. J. BIEN, J. TAYLOR, AND R. TIBSHIRANI. A lasso for hierarchical interactions. Annals of Statistics 43(3): 1111–1141 (2013).8. D. BIENSTOCK. Computational study of a family of mixed-integer quadratic programming problems. Mathematical Programming,

Series A 74(2): 121–140 (1996).9. J. BRODIE, I. DAUBECHIES, C. DE MOL AND D. GIANNONE, AND I. LORIS. Sparse and stable markowitz portfolios. Proceedings of

the National Academy of Sciences 106(30): 12267–12272 (2009).10. O.P. BURDAKOV, C. KANZOW, AND A. SCHWARTZ. Mathematical programs with cardinality constraints: Reformulation by

complementarity-type conditions and a regularization method. SIAM Journal Optimization 26(1):397–425 (2016).11. C. CHEN, X. LI, C. TOLMAN, S. WANG, AND Y. YE. Sparse portfolio selection via quasi-norm regularization. https://arxiv.org/

abs/1312.6350 (2013).12. M. CONFORTI AND G. CORNUEJOLS. A class of logic problems solvable by linear programming. Journal of the ACM 42(5): 1107–1112

(1995).13. M. CONFORTI AND G. CORNUEJOLS. Balanced matrices. In: K. Aardal, G.L. Nemhauser, and R. Weismantel (Eds.) Discrete Optimiza-

tion. Handbooks in Operations Research and Management Science. Volume 12 (Elsevier, Amsterdam 2005) pp. 277-320.14. J. FAN AND R. LI. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical

Association 96(456): 1348–1360 (2001).15. M. FENG, J.E. MITCHELL, J.S. PANG, A. WAECHTER, AND X. SHEN. Complementarity formulations of `0-norm optimization prob-

lems. Pacific Journal of Optimization, accepted August 2016.16. J.H. FRIEDMAN, T. HASTIE, AND R. TIBSHIRANI. Sparse inverse covariance estimation with the graphical Lasso. Biostatistics 9(3):

432–441 (2013).17. M. HAMADA, AND C.F.J. WU. Analysis of designed experiments with complex aliasing. Journal of Quality Technology 24:130–137

(1992).18. T. HASTIE, R. TIBSHIRANI, AND M. WAINWRIGHT. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press

Taylor & Francis Group (Boca Raton 2015).19. J. HUANG, P. BREHENY, AND S. MA. A selective review of group selection in high-dimensional models. Statistical Science 27(4):

481–499 (2012).20. L. JACOB, G. OBOZINSKI, AND J.P. VERT. Group lasso with overlap and graph lasso. Proceeding of the 26th Annual International

Conference on Machine Learning, Montreal, Canada (ICML ’09, ACM New York) pp. 433–440 (2009).21. C. KANZOW AND A. SCHWARTZ. A new regularization method for mathematical programs with complementarity constraints with

strong convergence properties. SIAM Journal Optimization 23(2): 770–798 (2013).22. H.A. LE THI, D.T. PHAM, AND X.T. VO. DC approximation approaches for sparse optimization. European Journal of Operations

Research 244 (2015) 26–46.23. P. MCCULLAGH AND J.A. NELDER. Generalized Linear Models. Chapman & Hall (London 1983).24. A.-V. DE MIGUEL, M. FRIEDLANDER, F. J. NOGALES, AND S. SCHOLTES. A two-sided relaxation scheme for mathematical programs

with equilibrium constraints. SIAM Journal on Optimization 16(2): 587–609 (2006).25. G. NEMHAUSER AND L. WOLSEY. Integer and Combinatorial Optimization, John Wiley & Sons (1999).26. J.S. PANG, M. RAZAVIYAYN, AND A. ALVARADO. Computing B-stationary points of nonsmooth DC programs. Mathematics of Op-

erations Research, published online (October 2016) http://dx.doi.org/10.1287/moor.2016.0795.27. H. PARK, A. NIIDA, S. MIYANO, AND S. IMOTO. Sparse overlapping group Lasso for integrative multi-Omics analysis. Journal of

Computational Biology 22(2): 73–84 (2015).28. D. RALPH AND S.J. WRIGHT. Some properties of regularization and penalization schemes for MPECs. Optimization Methods and

Software 19(5): 527–556 (2004).29. R.T. ROCKAFELLAR AND R.J.-B. WETS. Variational Analysis. (Springer 1998).30. S. SCHOLTES. Convergence properties of a regularisation scheme for mathematical programs with complementarity constraints. SIAM

Journal on Optimization 11(4): 918–936 (2001).31. R. TIBSHIRANI. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological)

58(1): 267–288 (1996).32. R. TIBSHIRANI, M.A. SAUNDERS, S. ROSSET, J. ZHU, AND K. KNIGHT. Sparsity and smoothness via the fused lasso. Journal of the

Royal Statistical Society: Series B (Statistical Methodology) 67(1): 91–108 (2005).33. J. WANG AND J. YE. Multi-layer feature reduction for tree structured group lasso via hierarchical projection. Proceedings of the 28th

International Conference on Neural Information Processing Systems, Montreal, Canada (December 2015) pp. 1279–1287.34. M. YUAN AND Y. LIN. Model selection and estimation in regression with grouped variables. Journal of Royal Statistical Society Series

B: Statistical Methods 68(1) 49–67 (2006).35. C.H. ZHANG. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38(2): 894–942 (2010).36. T. ZHANG. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research 11: 1081–1107

(2010).37. P. ZHAO, G. ROCHA, AND B. YU. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of

Statistics 37(6A): 3468–3497 (2009).38. X. ZHENG, X. SUN, D. LI, AND J. SUN. Successive convex approximations to cardinality-constrained convex programs: a piecewise-

linear DC approach. Computational Optimization and Applications 59(1-2): 379–397 (2014).

28

https://arxiv.org/abs/1312.6350

https://arxiv.org/abs/1312.6350

http://dx.doi.org/10.1287/moor.2016.0795

Date post:	12-Aug-2019
Category:	Documents
Upload:	vuanh
View:	214 times
Download:	0 times

Structural Properties of Afﬁne Sparsity Constraints · Noname manuscript No. (will be inserted by...

Documents