+ All Categories
Home > Documents > IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL...

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL...

Date post: 27-May-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
14
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression Models: Recoverability and Estimation Vassilis Kekatos, Member, IEEE, and Georgios B. Giannakis, Fellow, IEEE Abstract—Volterra and polynomial regression models play a major role in nonlinear system identification and inference tasks. Exciting applications ranging from neuroscience to genome-wide association analysis build on these models with the additional requirement of parsimony. This requirement has high interpre- tative value, but unfortunately cannot be met by least-squares based or kernel regression methods. To this end, compressed sampling (CS) approaches, already successful in linear regres- sion settings, can offer a viable alternative. The viability of CS for sparse Volterra and polynomial models is the core theme of this work. A common sparse regression task is initially posed for the two models. Building on (weighted) Lasso-based schemes, an adaptive RLS-type algorithm is developed for sparse poly- nomial regressions. The identifiability of polynomial models is critically challenged by dimensionality. However, following the CS principle, when these models are sparse, they could be re- covered by far fewer measurements. To quantify the sufficient number of measurements for a given level of sparsity, restricted isometry properties (RIP) are investigated in commonly met poly- nomial regression settings, generalizing known results for their linear counterparts. The merits of the novel (weighted) adap- tive CS algorithms to sparse polynomial modeling are verified through synthetic as well as real data tests for genotype-pheno- type analysis. Index Terms—Compressive sampling, Lasso, polynomial ker- nels, restricted isometry properties, Volterra filters. I. INTRODUCTION N ONLINEAR systems with memory appear frequently in science and engineering. Pertinent application areas in- clude physiological and biological processes [3], power am- plifiers [2], loudspeakers [31], speech, and image models, to name a few; see, e.g., [16]. If the nonlinearity is sufficiently smooth, the Volterra series offers a well-appreciated model of the output expressed as a polynomial expansion of the input using Taylor’s theorem [20]. The expansion coefficients of order are -dimensional sequences of memory generalizing the one-dimensional impulse response sequence Manuscript received March 03, 2011; revised June 28, 2011; accepted Au- gust 08, 2011. Date of publication August 30, 2011; date of current version November 16, 2011. The associate editor coordinating the review of this manu- script and approving it for publication was Dr. Trac D. Tran. This work was sup- ported by the Marie Curie International Outgoing Fellowship No. 234914 within the 7th European Community Framework Programme; by NSF Grants CCF- 0830480, 1016605, and ECCS-0824007, 1002180; and by the QNRF-NPRP award 09-341-2-128. Part of the results of this work was presented at CAMSAP, Aruba, Dutch Antilles, December 2009 [15]. The authors are with the Department of Electrical and Computer Engineering, and the Digital Technology Center, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2011.2165952 encountered with linear systems. However, polynomial expan- sions of nonlinear mappings go beyond filtering. Polynomial regression aims at approximating a multivariate nonlinear func- tion via a polynomial expansion [13]. Apart from its extensive use for optical character recognition and other classification tasks [23], (generalized) polynomial regression has recently emerged as a valuable tool for revealing genotype-phenotype relationships in genome-wide association (GWA) studies [9], [18], [27], [28]. Volterra and polynomial regression models are jointly investigated here. Albeit nonlinear, their input-output (I/O) re- lationship is linear with respect to the unknown parameters, and can thus be estimated via linear least-squares (LS) [13], [16]. The major bottleneck is the “curse of dimensionality,” since the number of regression coefficients grows as . This not only raises computational and numerical stability challenges, but also dictates impractically long data records for reliable estimation. One approach to coping with this dimensionality issue it to view polynomial modeling as a kernel regression problem [11], [13], [23]. However, various applications admit sparse polynomial ex- pansions, where only a few, say out of , expansion coeffi- cients are nonzero—a fact that cannot be exploited via polyno- mial kernel regression. The nonlinearity order, the memory size, and the nonzero coefficients may all be unknown. Nonetheless, the polynomial expansion in such applications is sparse—an at- tribute that can be due to either a parsimonious underlying phys- ical system, or an overparameterized model assumed. Sparsity in polynomial expansions constitutes the motivation behind this work. Volterra system identification and polynomial regression are formulated in Section II. After explaining the link between the two problems, several motivating applications with inherent sparse structure are provided. Section III deals with the estimation of sparse polynomial expansions. Traditional polynomial filtering approaches either drop the contribution of expansion terms a fortiori, or adopt the sparsity-agnostic LS estimator [16]. Alternative estimators rely on: estimating a frequency-domain equivalent model; modeling the nonlinear filter as the convolution of two or more linear fil- ters; transforming the polynomial representation to a more par- simonious one (e.g., using the Laguerre expansion); or by esti- mating fewer coefficients and then linearly interpolating the full model; see [16] and references therein. However, the recent ad- vances on compressive sampling [6], [8], and the least-absolute shrinkage and selection operator (Lasso) [25] offer a precious toolbox for estimating sparse signals. Sparse Volterra channel estimators are proposed in [15] and [17]. Building on well-es- tablished (weighted) Lasso estimators [25], [32], and their effi- 1053-587X/$26.00 © 2011 IEEE
Transcript
Page 1: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907

Sparse Volterra and Polynomial Regression Models:Recoverability and Estimation

Vassilis Kekatos, Member, IEEE, and Georgios B. Giannakis, Fellow, IEEE

Abstract—Volterra and polynomial regression models play amajor role in nonlinear system identification and inference tasks.Exciting applications ranging from neuroscience to genome-wideassociation analysis build on these models with the additionalrequirement of parsimony. This requirement has high interpre-tative value, but unfortunately cannot be met by least-squaresbased or kernel regression methods. To this end, compressedsampling (CS) approaches, already successful in linear regres-sion settings, can offer a viable alternative. The viability of CSfor sparse Volterra and polynomial models is the core theme ofthis work. A common sparse regression task is initially posedfor the two models. Building on (weighted) Lasso-based schemes,an adaptive RLS-type algorithm is developed for sparse poly-nomial regressions. The identifiability of polynomial models iscritically challenged by dimensionality. However, following theCS principle, when these models are sparse, they could be re-covered by far fewer measurements. To quantify the sufficientnumber of measurements for a given level of sparsity, restrictedisometry properties (RIP) are investigated in commonly met poly-nomial regression settings, generalizing known results for theirlinear counterparts. The merits of the novel (weighted) adap-tive CS algorithms to sparse polynomial modeling are verifiedthrough synthetic as well as real data tests for genotype-pheno-type analysis.

Index Terms—Compressive sampling, Lasso, polynomial ker-nels, restricted isometry properties, Volterra filters.

I. INTRODUCTION

N ONLINEAR systems with memory appear frequently inscience and engineering. Pertinent application areas in-

clude physiological and biological processes [3], power am-plifiers [2], loudspeakers [31], speech, and image models, toname a few; see, e.g., [16]. If the nonlinearity is sufficientlysmooth, the Volterra series offers a well-appreciated modelof the output expressed as a polynomial expansion of theinput using Taylor’s theorem [20]. The expansion coefficientsof order are -dimensional sequences of memorygeneralizing the one-dimensional impulse response sequence

Manuscript received March 03, 2011; revised June 28, 2011; accepted Au-gust 08, 2011. Date of publication August 30, 2011; date of current versionNovember 16, 2011. The associate editor coordinating the review of this manu-script and approving it for publication was Dr. Trac D. Tran. This work was sup-ported by the Marie Curie International Outgoing Fellowship No. 234914 withinthe 7th European Community Framework Programme; by NSF Grants CCF-0830480, 1016605, and ECCS-0824007, 1002180; and by the QNRF-NPRPaward 09-341-2-128. Part of the results of this work was presented at CAMSAP,Aruba, Dutch Antilles, December 2009 [15].

The authors are with the Department of Electrical and Computer Engineering,and the Digital Technology Center, University of Minnesota, Minneapolis, MN55455 USA (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2011.2165952

encountered with linear systems. However, polynomial expan-sions of nonlinear mappings go beyond filtering. Polynomialregression aims at approximating a multivariate nonlinear func-tion via a polynomial expansion [13]. Apart from its extensiveuse for optical character recognition and other classificationtasks [23], (generalized) polynomial regression has recentlyemerged as a valuable tool for revealing genotype-phenotyperelationships in genome-wide association (GWA) studies [9],[18], [27], [28].

Volterra and polynomial regression models are jointlyinvestigated here. Albeit nonlinear, their input-output (I/O) re-lationship is linear with respect to the unknown parameters, andcan thus be estimated via linear least-squares (LS) [13], [16].The major bottleneck is the “curse of dimensionality,” since thenumber of regression coefficients grows as . This notonly raises computational and numerical stability challenges,but also dictates impractically long data records for reliableestimation. One approach to coping with this dimensionalityissue it to view polynomial modeling as a kernel regressionproblem [11], [13], [23].

However, various applications admit sparse polynomial ex-pansions, where only a few, say out of , expansion coeffi-cients are nonzero—a fact that cannot be exploited via polyno-mial kernel regression. The nonlinearity order, the memory size,and the nonzero coefficients may all be unknown. Nonetheless,the polynomial expansion in such applications is sparse—an at-tribute that can be due to either a parsimonious underlying phys-ical system, or an overparameterized model assumed. Sparsityin polynomial expansions constitutes the motivation behind thiswork. Volterra system identification and polynomial regressionare formulated in Section II. After explaining the link betweenthe two problems, several motivating applications with inherentsparse structure are provided.

Section III deals with the estimation of sparse polynomialexpansions. Traditional polynomial filtering approaches eitherdrop the contribution of expansion terms a fortiori, or adopt thesparsity-agnostic LS estimator [16]. Alternative estimators relyon: estimating a frequency-domain equivalent model; modelingthe nonlinear filter as the convolution of two or more linear fil-ters; transforming the polynomial representation to a more par-simonious one (e.g., using the Laguerre expansion); or by esti-mating fewer coefficients and then linearly interpolating the fullmodel; see [16] and references therein. However, the recent ad-vances on compressive sampling [6], [8], and the least-absoluteshrinkage and selection operator (Lasso) [25] offer a precioustoolbox for estimating sparse signals. Sparse Volterra channelestimators are proposed in [15] and [17]. Building on well-es-tablished (weighted) Lasso estimators [25], [32], and their effi-

1053-587X/$26.00 © 2011 IEEE

Page 2: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

5908 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011

cient coordinate descent implementation [12], the present paperdevelops an adaptive RLS-type sparse polynomial estimation al-gorithm, which generalizes [1] to the nonlinear case, and con-stitutes the first contribution.

Performance of the (weighted) Lasso estimators has beenanalyzed asymptotically in the number of measurements

[10], [32]. With finite samples, identifiability of Lasso-basedestimators and other compressive sampling reconstructionmethods can be assessed via the so-called restricted isometryproperties (RIP) of the involved regression matrix [6], [4]. Ithas been shown that certain random matrix ensembles satisfydesirable properties with high probability when scales atleast as [6]. For Gaussian, Bernoulli, and uniformToeplitz matrices appearing in sparse linear filtering, the lowerbound on has been shown to scale as [14], [22].Section IV-A deals with RIP analysis for Volterra filters, whichis the second contribution of this work. It is shown that for auniformly distributed input, the second-order Volterra filteringmatrix satisfies the RIP with high probability when scalesas , which extends the bound from the linear to theVolterra filtering case.

The third contribution is the RIP analysis for the sparsepolynomial regression setup (Section IV-B). Because thereare no dependencies across rows of the involved regressionmatrix, different tools are utilized and the resultant RIP boundsare stronger than their Volterra filter counterparts. It is provedthat for a uniform input, -sparse linear-quadratic regressionrequires a number of measurements that scales as . Thesame result holds also for a model oftentimes employed forGWA analysis.

Applicability of the existing batch sparse estimators and theirdeveloped adaptive counterparts is demonstrated through nu-merical tests in Section V. Simulations on synthetic and realGWA data show that sparsity-aware polynomial estimators cancope with the curse of dimensionality and yield parsimoniousyet accurate models with relatively short data records. The workis concluded in Section VI.

Notation: Lower-(upper-)case boldface letters are reservedfor column vectors (matrices), and calligraphic letters for sets;

denotes the all-ones vector of length ; denotes trans-position; stands for the multivariate Gaussian proba-bility density with mean and covariance matrix ; de-

notes the expectation operator; forstands for the -norm in , and the -pseudonorm

that equals the number of nonzero entries of .

II. PROBLEM FORMULATION: CONTEXT AND MOTIVATION

Nonlinear system modeling using the Volterra expansion aswell as the more general notion of (multivariate) polynomialregression are reviewed in this section. For both problems,the nonlinear I/O dependency is expressed in the standard(linear with respect to the unknown coefficients) matrix-vectorform. After recognizing the “curse of dimensionality” inherentto the involved estimation problems, motivating applicationsadmitting (approximately) sparse polynomial representationsare highlighted.

A. Volterra Filter Model

Consider a nonlinear, discrete-time, and time-invariant I/O re-lationship , where and de-note the input and output samples at time . While such non-linear mappings can have infinite memory, finite-memory trun-cation is adopted in practice to yield , where

with finite. Under smooth-ness conditions, this I/O relationship can be approximated by aVolterra expansion oftentimes truncated to a finite-order as

(1)

where captures unmodeled dynamics and observationnoise, assumed to be zero-mean and independent of aswell as across time; and denotes the output of the

th-order Volterra module given by

(2)

where memory has been considered identical for all moduleswithout loss of generality. The Volterra expansion in (1), (2) hasbeen thoroughly studied in its representation power and conver-gence properties; see, e.g., [16], [20].

The goal here is to estimate for, and , given the I/O samples

, and upper bounds on the expansion orderand the memory size . Although this problem has been

extensively investigated [16], the sparsity present in the Volterrarepresentation of many nonlinear systems will be exploitedhere to develop efficient estimators.

To this end, (1) will be expressed first in a standard ma-trix-vector form [16]. Define the vectors

for , where denotes the Kronecker product; andwrite the th-order Volterra output as ,where contains the coefficients of arrangedaccordingly. Using the latter, (1) can be rewritten as

(3)

where and. Stacking (1) for all , one arrives at the

linear model

(4)

where , , and.

B. Polynomial Regression Model

Generalizing the Volterra filter expansion, polynomialregression aims at approximating a nonlinear function

of variables through an expan-sion similar to (1) and (2), where the input vector is nowdefined as , and is not neces-sarily a time index. Again the goal is to estimate

Page 3: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

KEKATOS AND GIANNAKIS: SPARSE VOLTERRA AND POLYNOMIAL REGRESSION MODELS: RECOVERABILITY AND ESTIMATION 5909

given . Polynomial regression can be inter-preted as the th-order Taylor series expansion of ,and appears in several multilinear estimation and predictionproblems in engineering, natural sciences, and economics [13].

By simply choosing for , theVolterra filter is a special case of polynomial regression. Sincethis extra property has not been exploited in deriving (1)–(4),these equations carry over to the polynomial regression setup.For this reason, the same notation will be used henceforth for thetwo setups; the ambiguity will be easily resolved by the context.

C. The Curse of Dimensionality

Estimating the unknown coefficients in both the Volterrasystem identification and in polynomial regression is criti-cally challenged by the curse of dimensionality. The Kro-necker product defining imply that the dimension of

is , and consequently and have dimension. Note that all possible permutations

of the indexes multiply the same input term; e.g., and both multiply the

monomial . To obtain a unique representation of(2), only one of these permutations is retained. After discardingthe redundant coefficients, the dimension of and ’sis reduced to [16]. Exploiting such redundancies inmodules of all orders eventually shortens and ’s todimension

(5)

which still grows fast with increasing and . For notationalbrevity, and will denote the shortened versions of the vari-ables in (4); that is matrix will be .

D. Motivating Applications

Applications are outlined here involving models that admit(approximately) sparse polynomial representations. Whenand are unknown, model order selection can be accomplishedvia sparsity-cognizant estimators. Beyond this rather mundanetask, sparsity can arise due to problem specifications, or beimposed for interpretability purposes.

A special yet widely employed Volterra model is the so-calledlinear–nonlinear–linear (LNL) one [16]. It consists of a linearfilter with impulse response , in cascade witha memoryless nonlinearity , and a second linear filter

. The overall memory is thus . Ifis analytic on an open set , it accepts a Taylor series

expansion in . It can be shownthat the th-order redundant Volterra module is given by [16,Ch. 2]

(6)

for . In (6), there are -tuplesfor which there is no such that

for all . For these -tuples, thecorresponding Volterra coefficient is zero. As an example, forfilters of length and for , among the364 nonredundant Volterra coefficients, the nonzero ones are nomore than 224. When and are not known, the locations ofthe zero coefficients cannot be determined a priori. By droppingthe second linear filter in the LNL model, the Wiener model isobtained. Its Volterra modules follow immediately from (6) andhave the separable formfor every [16]. Likewise, by ignoring the first filter, the LNLmodel is transformed to the so-called Hammerstein model inwhich for ; and0 otherwise. The key observation in all three models is that ifat least one of the linear filters is sparse, the resulting Volterrafilter is even sparser.

That is usually the case when modeling the nonlinear be-havior of loudspeakers and high-power amplifiers (HPA) [2],[16]. When a small-size (low-cost) loudspeaker is located closeto a microphone (as is the case in cellular phones, teleconfer-encing, hands-free, or hearing aid systems), the loudspeakersound is echoed by the environment before arriving at themicrophone. A nonlinear acoustic echo canceller should adap-tively identify the impulse response comprising the loudspeakerand the room, and thereby subtract undesirable echoes from themicrophone signal. The cascade of the loudspeaker, typicallycharacterized by a short memory LNL or a Wiener model, andthe typically long but (approximately) sparse room impulseresponse gives rise to a sparse Volterra filter [31]. Similarly,HPAs residing at the transmitters of wireless communicationlinks are usually modeled as LNL structures having only a fewcoefficients contributing substantially to the output [2, p. 60].When the HPA is followed by a multipath wireless channelrepresented by a sparse impulse response, the overall systembecomes sparse too [17].

Sparse polynomial expansions are also encountered inneuroscience and bioinformatics. Volterra filters have beenadopted to model causal relationships in neuronal ensemblesusing spike-train data recorded from individual neurons [3],[24]. Casting the problem as a probit Volterra regression,conventional model selection techniques have been pursued tozero blocks of Volterra expansion coefficients, and thus revealneuron connections. Furthermore, GWA analysis dependscritically on sparse polynomial regression models [9], [27],[28]. Through GWA studies, geneticists identify which genesdetermine certain phenotypes, e.g., human genetic diseasesor traits in other species. Analysis has revealed that geneticfactors involve multiplicative interactions among genes—afact known as epistasis; hence, linear gene-phenotype modelsare inadequate. The occurrence of a disease can be posed as a(logistic) multilinear regression, where apart from single-geneterms, the output depends on products of two or more genesas well [9]. To cope with the underdeterminacy of the problemand detect gene-gene interactions, sparsity-promoting logisticregression methods have been developed; see, e.g., [27].

Based on these considerations, exploiting sparsity in polyno-mial representations is well motivated and prompted us to de-velop the sparsity-aware estimators described in the followingsection.

Page 4: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

5910 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011

III. ESTIMATION OF SPARSE POLYNOMIAL EXPANSIONS

One of the attractive properties of Volterra and polynomialregression models is that the output is a linear function of thewanted coefficients. This allows one to develop standard esti-mators for in (4). However, the number of coefficients canbe prohibitively large for reasonable values of and , evenafter removing redundancies. Hence, accurately estimating re-quires a large number of measurements which: i) may be im-practical and/or violate the stationarity assumption in an adap-tive system identification setup; ii) entails considerable com-putational burden; and iii) raises numerical instability issues.To combat this curse of dimensionality, batch sparsity-awaremethods will be proposed first for polynomial modeling, andbased on them, adaptive algorithms will be developed after-wards.

A. Batch Estimators

Ignoring in (4), the vector can be recovered by solvingthe linear system of equations . Generally, a uniquesolution is readily found if ; but when , thereare infinitely many solutions. Capitalizing on the sparsity of ,one should ideally solve

(7)

Recognizing the NP-hardness of solving (7), compressive sam-pling suggests solving instead the linear program [6], [8]

(8)

which is also known as basis pursuit and can quantifiably ap-proximate the solution of (7); see Section IV for more on the re-lation between (7) and (8). However, modeling errors and mea-surement noise, motivate a LS estimator

. If and has full column rank, the LS solution isuniquely found as . If the input is drawneither from a continuous distribution or from a finite alphabetof at least values, is invertible almost surely; butits condition number grows with and [19]. A large con-dition number translates to numerically ill-posed inversion of

and amplifies noise too. If , the LS solution isnot unique; but one can choose the minimum -norm solution

.For both over/underdetermined cases, one may resort to the

ridge ( -norm regularized) solution

(9a)

(9b)

for some , where the equality can be readily proved by al-gebraic manipulations. Calculating, storing in the main memory,and inverting the matrices in parentheses are the main bottle-necks in computing via (9). Choosing (9a) versus (9b)depends on how and compare. Especially for polynomial(or Volterra) regression, the th entry of , whichis the inner product , can be also expressed as

. This computational alternative is aninstantiation of the so-called kernel trick, and reduces the cost

of computing in (9b) from to[23], [11]; see also Section III-C.

In any case, neither nor are sparse. To effect spar-sity, the idea is to adopt as regularization penalty the -norm ofthe wanted vector [25]

(10)

where is the th entry of , and for .Two choices of are commonly adopted:

w1) for , which corresponds to theconventional Lasso estimator [25]; or

w2) for , which leads to theweighted Lasso estimator [32].

Asymptotic performance of the Lasso estimator has been an-alyzed in [10], where it is shown that the weighted Lasso es-timator exhibits improved asymptotic properties over Lasso atthe price of requiring the ridge regression estimates to evaluatethe ’s [32]. For the practical finite-sample regime, perfor-mance of the Lasso estimator is analyzed through the RIP of

in Section IV, where rules of thumb are also provided for theselection of as well (cf. Lemma 1).

Albeit known for linear regression models, the novelty hereis the adoption of (weighted) Lasso for sparse polynomial re-gressions. Sparse generalized linear regression models, such as

-regularized logistic and probit regressions can be fit as a se-ries of successive Lasso problems after appropriately redefiningthe response and weighting the input [13, Sec. 4.4.1], [27].Hence, solving and analyzing Lasso for sparse polynomial ex-pansions is important for generalized polynomial regression aswell. Moreover, in certain applications, Volterra coefficients arecollected in subsets (according to their order or other criteria)that are effected to be (non)zero as a group [24]. In such appli-cations, using methods promoting group-sparsity is expected toimprove recoverability [30]. Even though sparsity is manifestedhere at the single-coefficient level, extensions toward the afore-mentioned direction constitutes an interesting future researchtopic.

Algorithmically, the convex optimization problem in (10) canbe tackled by any generic second-order cone program (SOCP)solver, or any other method tailored for the Lasso estimator.The method of choice here is the coordinate descent schemeof [12], which is outlined next for completeness. The core ideais to iteratively minimize (10) w.r.t. one entry of at a time,while keeping the remaining ones fixed, by solving the scalarminimization problem

(11)

where is the th column1 of , variables anddenote and , respectively, having the th column (entry) re-moved, and is the latest value for the optimum . It turnsout that the component-wise minimization of (11) admits theclosed-form solution [12]

(12)

1Recall that ���� stands for the �th row of �.

Page 5: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

KEKATOS AND GIANNAKIS: SPARSE VOLTERRA AND POLYNOMIAL REGRESSION MODELS: RECOVERABILITY AND ESTIMATION 5911

Algorithm 1: CCD-(W)L

1: Initialize .

2: Compute matrix .3: repeat4: for do

5: Update as .

6: Update using (12).

7: Update as .8: end for

9: until convergence of .

where , is the th entry of the samplecorrelation or Grammian matrix and is the thentry of . After initializing to anyvalue (usually zero), the algorithm iterates by simply updatingthe entries of via (12). By defining , vector

can be updated as

(13)

with being the th column of . After updating to its newvalue (12), has to be updated too as

(14)

It is easy to see that in (13), (14) are not essentiallyneeded, and one can update only . These iterates constitute thecyclic coordinate descent (CCD) algorithm for the (weighted)Lasso problem, and are tabulated as Algorithm 1. CCD-(W)L isguaranteed to converge to a minimizer of (10) [12]. Apart fromthe initial computation of and which incurs complexity

, the complexity of Algorithm 1 as presented here isper coordinate iteration; see also [12].

B. Recursive Estimators

Unlike batch estimators, their recursive counterparts offercomputational and memory savings, and enable tracking ofslowly time-varying systems. The recursive LS (RLS) algo-rithm is an efficient implementation of the LS, and the ridgeestimators. It solves sequentially the following problem:

(15)

where denotes the forgetting factor and a small positive con-stant. For time-invariant systems, is set to 1, whileenables tracking of slow variations. Similar to the batch LS, theRLS does not exploit the a priori knowledge on the sparsity of

, and suffers from numerical instability especially when theeffective memory of the algorithm, , is comparable to thedimension of .

To overcome these limitations, the following approach is ad-vocated for polynomial regression:

(16)

Algorithm 2: CCD-R(W)L

1: Initialize , , .2: for do3: Update and via (17a) and (17b).4: for do

5:

6:

7:8: end for9: end for

where can be chosen asa1) , which corresponds to the

recursive Lasso (RL) problem; ora2) , leading to the

recursive weighted Lasso (RWL) one.The sequence cannot be updated recursively, and (16)calls for a convex optimization solver for each time instant ormeasurement . To avoid the computational burden involved,several methods have been developed for sparse linear models;see [1] and the references therein. The coordinate descent algo-rithm of Section III-A can be extended to (16) by first updating

and as

(17a)

(17b)

where is a solution at time . The minimizercan then be found by performing component-wise minimiza-tions until convergence in the spirit of the corresponding batchestimator. However, to speed up computations and leverage theadaptivity of the solution, we choose to perform a single cycleof component-wise updates. Thus, is formed by the iteratesof the inner loop in Algorithm 2, where , , , and

are defined as before.The presented algorithm called hereafter cyclic coordinate

descent for recursive (weighted) Lasso (CCD-R(W)L) is sum-marized as Algorithm 2; the convergence properties of CCD-RLhave been established in [1] for linear regression, but carry overdirectly to the polynomial regression considered here. Its com-plexity is per measurement which is of the same order as

the RLS. By setting or , the CCD-R(W)L algorithms approximate the minimizers of the R(W)Lproblems.

C. Polynomial Reproducing Kernels

An alternative approach to polynomial modeling is viakernel regression [23]. In the general setup, kernel regressionapproximates a nonlinear function assuming it can belinearly expanded over a possibly infinite number of basis func-tions as . When

with denoting a judiciously selected posi-tive definite kernel, lies in a reproducing kernel Hilbertspace , and kernel regression is formulated as the variationalproblem

(18)

Page 6: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

5912 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011

where is an arbitrary cost function, and isthe norm in that penalizes complexity of . It turnsout that there exists a minimizer of (18) expressed as

, while for many mean-ingful costs the ’s can be computed in using convexoptimization solvers [23].

Polynomial regression can be cast as kernel regression aftersetting to be either the homogeneous poly-nomial kernel , or, one of the inhomogeneousones or [11],[23]. Once the ’s have been estimated, the polynomial coeffi-cients (cf. (4)) can be found in closed form [11]. Furthermore,objectives such as the -insensitive cost, yield sparsity inthe -domain, and thus designate the so-called support vectorsamong the ’s [23]. Even though kernel regression allevi-ates complexity concerns, the which can indirectly obtainedcannot be sparse. Thus, sparsity-aware estimation in the primal

-domain (as opposed to the dual -domain) comes with in-terpretational and modeling advantages.

IV. IDENTIFIABILITY OF SPARSE POLYNOMIAL MODELS

This section focuses on specifying whether the optimizationproblems in (8) and (10) are capable of identifying a sparsepolynomial expansion. The asymptotic in behavior of the(weighted) Lasso estimator has been studied in [10], [32]; prac-tically though one is more interested in finite-sample recover-ability guarantees. One of the tools utilized to this end is theso-called restricted isometry properties (RIP) of the involvedregression matrix . These are defined as [6]:

Definition 1 (Restricted Isometry Properties (RIP)): Ma-trix possesses the restricted isometry of order ,denoted as , if for all with

(19)

RIP were initially derived to provide identifiability conditionsof an -sparse vector given noiseless linear measurements

. It has been shown that the -pseudonorm minimiza-tion in (7) can uniquely recover if and only if . Ifadditionally , then is the unique minimizer ofthe basis pursuit cost in (8) [5].

RIP-based analysis extends to noisy linear observations of an-sparse vector; that is, for . If , the

constrained version of the Lasso optimization problem

(20)

yields , where

whenever [5]. Furthermore, if ,the Dantzig selector defined as

(21)

for , satisfies, where with

probability at least whenever[7]. Similarly, RIP-based recoverability guarantees can be

derived in the stochastic noise setting for the Lasso estimatoras described in the following lemma.

Lemma 1: Consider the linear model , wherethe columns of are of unit -norm, ,and with . Let denote the minimizerof the Lasso estimator (10) with for , and

for . If , the bounds

(22)

(23)

(24)

hold with probability at least for

.Proof: The lemma follows readily by properly adapting

Lemma 4.1 and Theorem 7.2 of [4].The earlier stated results document and quantify the role of

RIP-based analysis in establishing identifiability in a compres-sive sampling setup. However, Definition 1 suggests that findingthe RIP of a given matrix is probably a hard combinatorialproblem. Thus, to derive sparse recoverability guarantees oneusually resorts to random matrix ensembles to provide prob-abilistic bounds on their RIP [6], [22]. In the generic sparselinear regression setup, it has been shown that when the entriesof are independently Gaussian or Bernoulli, pos-sesses RIP with probability at least when the

number of measurements is , where is auniversal constant; this bound is known to be optimal [6]. In asparse system identification setup where the regression matrixhas a Toeplitz structure, the condition on the number of mea-surements obtained so far loosens to a scaling of fora Gaussian, Bernoulli, or uniform input [14], [22]. The quadraticscaling of w.r.t. in the latter bound versus the linear scalingin the former can be attributed to the statistical dependenciesamong the entries of [22]. Our contribution pertains to char-acterizing the RIP of the involved regression matrix for both theVolterra system identification and the multivariate polynomialregression scenarios.

A. RIP for Volterra System Identification

For the Volterra filtering problem under study, the followingassumptions will be in force:

as1) input is independently drawn from the uniformdistribution, i.e., ; and

as2) expansion is of order (linear-quadratic Volterramodel).

Regarding as1), recall that the Volterra expansion is a Taylor se-ries approximation of a nonlinear function; thus, it is reasonableto focus on a bounded input region. Moreover, practically, oneis frequently interested in the behavior of a nonlinear system fora limited input range. For as2), the nonhomogeneous quadraticVolterra model is a commonly adopted one. Generalization tomodels with is not straightforward and goes beyond thescope of our RIP analysis. The considered Volterra filter length

Page 7: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

KEKATOS AND GIANNAKIS: SPARSE VOLTERRA AND POLYNOMIAL REGRESSION MODELS: RECOVERABILITY AND ESTIMATION 5913

is ; and, for future use, it is easy to check that under(as1) it holds that and .

To start, recall the definition of the Grammian matrixand let denote its th entry. As shown in [14,

Sec. III], the matrix possesses RIP if there exist positiveand with such that and

for every with . When these conditionshold, Gersgorin’s disc theorem guarantees that the eigenvaluesof Grammian matrices formed by any combination of columnsof lie in the interval , and possesses RIP

by definition. In a nutshell, for a regression matrix to havesmall ’s, and hence favorable compressed sampling proper-ties, it suffices that its Grammian matrix has diagonal entriesclose to unity and off-diagonal entries close to zero. If the in-volved regression matrix had unit -norm columns, then the

would be unity by definition and one could merely studythe quantity , defined as the coherence of ; seealso [22, p. 13] for the relation between coherence and the RIP.

In the Volterra filtering problem at hand, the diagonal entriesare not equal to one; but an appropriate normaliza-

tion of the columns of can provide at leastfor all . The law of large numbers dictates that given suf-ficiently enough measurements , the ’s will approachtheir mean value. Likewise, it is desirable for the off-diag-onal entries of to have zero mean, so that they vanishfor large . Such a requirement is not inherently satis-fied by all ’s with ; e.g., the inner product be-tween columns of the form and

for some and hasexpected value that is strictly positive.

To achieve the desired properties, namelyp1) for all ;p2) for all and

it will be soon established that instead of studying the RIP of ,one can equivalently focus on its modified versiondefined as

(25)

where corresponds to the constant (intercept or dc)

component, and are two Toeplitz matrices corre-sponding to the linear and quadratic parts defined as

......

...

......

...

and is a (non-Toeplitz) matrix related to thebilinear part given by

......

...

Consider now the Grammian of , namely .Comparing with , the columns of have their -normnormalized in expectation, and thus satisfies p1). Moreover,those columns of corresponding to the quadratic part (cf. sub-matrix ) are shifted by the variance of . One can readilyverify that p2) is then satisfied as well.

The transition from to raises a legitimate questionthough: Does the RIP of provide any insight on the com-pressed sampling guarantees for the original Volterra problem?In the noiseless scenario, we actually substitute the optimiza-tion problem in (8) by

(26)

Upon matching the expansions , the followingone-to-one mapping holds

(27a)

(27b)

(27c)

(27d)

for , .It is now apparent that a sparse solution of (26) translates to a

sparse solution of (8) except for the constant term in (27a). Bydeterministically adjusting the weights and the param-eter in (10), this argument carries over to the Lasso optimiza-tion problem and answers affirmatively the previously posedquestion. Note though that such a modification serves only an-alytical purposes; practically, there is no need to solve the mod-ified compressed sampling problems.

Remark 1: Interestingly, transition from the original Volterramatrix to the modified one resembles the replacement of theVolterra by the Wiener polynomials for nonlinear system identi-fication [16]. Wiener polynomials are known to facilitate mean-square error (MSE)-optimal estimation of Volterra modules for awhite Gaussian input; see, e.g., [16]. Our modification, adjustedto a uniformly distributed input, facilitates the RIP analysis ofthe Volterra regression matrix.

One of the main results of this paper is summarized in thefollowing theorem (see the Appendix for a proof).

Theorem 1 (RIP in Volterra Filtering): Letbe an input sequence of independent random variables drawnfrom , and define . Assume that the

Page 8: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

5914 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011

modified Volterra regression matrix defined in (25)is formed by such an input for and . Then,for any and for any , whenever

, the matrix possesses RIP for with

probability exceeding , where .

The theorem asserts that observations sufficeto recover an -sparse nonhomogeneous second-order Volterrafilter of memory probed by a uniformly distributed input.Since the number of unknowns is , the bound onscales also as . The bound agrees with the boundsobtained for the linear filtering setup [14], whereas now theconstants are larger due to the more involved dependenciesamong the entries of the associated regression matrix.

B. RIP for Multivariate Polynomial Regression

Consider now the case where describes a sparse linear-quadratic model

(28)Given output samples , corresponding to inputdata drawn independently from , thegoal is to recover the sparse vector comprising the

’s and ’s. Note that here.As explained in Section II, the noiseless expansion in (28) canbe written as ; but, contrary to the Volterra filteringsetup, the rows of are now statistically independent. Thelast observation differentiates significantly the RIP analysis forpolynomial regression and leads to tighter probabilistic bounds.

Our analysis builds on [22], which deals with finding asparse expansion of a function overa bounded orthonormal set of functions . Consid-ering a measurable space, e.g., a measurable subset ofendowed with a probability measure , the set of functions

is a bounded orthonormal system if forall

(29)

where denotes the Kronecker delta function, and for someconstant it holds that

(30)

After sampling at , the involvedregression matrix with entries admits thefollowing RIP characterization [22, Theorems 4.4 and 8.4].

Theorem 2 (RIP in Bounded Orthonormal Systems [22]):Let be the matrix associated with a bounded or-thonormal system with constant in (30). Then, for any

, there exist universal positive constants and ,such that whenever , the matrix pos-

sesses RIP with probability exceeding .In the linear-quadratic regression of (28), even though the

basis functions are bounded in ,

they are not orthonormal in the uniform probability measure.Fortunately, our input transformation trick devised for theVolterra filtering problem applies to the polynomial regres-sion as well. The expansion is now over the basis functions

(31)

where the last subset contains all the unique, two-variablemonomials lexicographically ordered. Upon stacking thefunction values in and properly defining , theexpansion can be replaced by , where theentries of are

(32)

Vectors and are related through the one-to-one mapping in(27); thus, sparsity in one is directly translated to the other. Iden-tifiability of a sparse can be guaranteed by the RIP analysis of

presented in the next lemma.Lemma 2 (RIP in Linear-Quadratic Regression): Let

for and independentrandom variables uniformly distributed in [ 1, 1], and de-fine . Assume that the modifiedpolynomial regression matrix in (32) is generated by thissequence for . Then, for any , there existuniversal positive constants and , such that whenever

, the matrix possesses RIP with

probability exceeding .

Proof: The inputs are uniformly drawn over, and it is easy to verify that the basis functionsin (31) form a bounded orthonormal system with

. Hence, Theorem 2 can be straightforwardly applied.Since for , it follows that .

Lemma 2 assures that an -sparse linear-quadratic -variateexpansion with independent uniformly distributed inputs can beidentified with high probability from a minimum number of ob-servations that scales as or . Comparing thisto Theorem 1, the bound here scales linearly with . Moreover,except for the increase in the power of the logarithmic factor,the bound is close to the one obtained for random Gaussianand Bernoulli matrices. The improvement over the Volterra RIPbound is explained by the simpler structural dependence of thematrix involved.

Another interesting polynomial regression paradigm is whenthe nonlinear function admits a sparse polynomial expan-sion involving inputs, and all products up to of these inputs,that is

(33)

Page 9: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

KEKATOS AND GIANNAKIS: SPARSE VOLTERRA AND POLYNOMIAL REGRESSION MODELS: RECOVERABILITY AND ESTIMATION 5915

This is the typical multilinear regression setup appearing inGWA studies [9], [27]. Because there are monomials oforder , the vector comprising all the expansion coefficientshas dimension

(34)

where the last inequality provides a rough upper bound. Thegoal is again to recover an -sparse given the sample phe-notypes over the genotype values . Vec-tors are drawn either from or de-pending on the assumed genotype model (additive for the firstalphabet; and dominant or recessive for the latter) [27]. Withoutloss of generality, consider the ternary alphabet with equal prob-abilities. Further, suppose for analytical convenience that the en-tries of are independent. Note that the input has mean zeroand variance 2/3.

The RIP analysis for the model in (33) exploits again The-orem 2. Since now every single input appears only linearly in(33), the basis functions are orthogonalw.r.t. the assumed point mass function. A bounded orthonormalsystem can be constructed after scaling as

(35)

while the set is bounded by . Similar to the linear-quadratic case in (28), the original multilinear expansionis transformed to , where is defined as in (32) with thenew basis of (35), and is an entry-wise rescaled version of

. Based on these facts, the RIP characterization of followsreadily from the ensuing lemma.2

Lemma 3 (RIP in Multilinear Expansion): Let forand independent random vari-

ables equiprobably drawn from { 1, 0, 1}, and defined asin (34). The modified multilinear regression matrix

in (32) and (35) is generated by this sequence. Then, forany , there exist universal positive constantsand , such that whenever ,

the matrix possesses RIP with probability exceeding

.

Since is often chosen in the order of 2 due to computationallimitations, Lemma 3 guarantees the RIP to hold with high prob-ability when the number of phenotype samples scales at leastas .

V. SIMULATED TESTS

The RIP analysis performed in the previous section providesprobabilistic bounds on the identifiability of sparse polynomialrepresentations. In this section, we evaluate the applicability ofsparsity-aware polynomial estimators using synthetic and realdata. The experimental results indicate that sparsity-promoting

2After our conference precursor [15], we became aware of a recent result in[18], which relates to Lemma 3. The differences are: i) only the � th-order termin expansion (33) is considered in [18]; and ii) inputs �� ���� adhere to thebinary {�1} alphabet in [18], as opposed to the ternary one in Lemma 3.

Fig. 1. MSE of (a) batch and (b) adaptive Volterra estimators.

recovery methods attain accurate results even when the numberof measurements is less than the RIP-derived bounds, and, inany case, they outperform the sparsity-agnostic estimators.

A. Batch and Adaptive Volterra Filters

We first focus on the sparse Volterra system identificationsetup. The system under study was an LNL one, consisting of alinear filter with impulse response ,in cascade with the memoryless nonlinearity

, and the same linear filter. This system is exactly de-scribed by a Volterra expansion with and , leadingto a total of coefficients collected in thevector . Out of the 364 coefficients only 48 are nonzero. Thesystem input was modeled as , while the outputwas corrupted by additive noise . First, thebatch estimators of Section III-A were tested, followed by theirsequential counterparts.

In Fig. 1(a), the obtained MSE, , averaged over100 Monte Carlo runs, is plotted against the number of ob-servations, , for the following estimators: i) the ridge esti-mator of (9) with ; ii) the Lasso (CCD-L) estimator with

; and iii) the weighted Lasso (CCD-WL) esti-mator with . The scaling rules for the two

s follow the results of [1] and [32]. It can be seen that thesparsity-agnostic ridge estimator is outperformed by the Lassoestimator for observation intervals shorter than 600. For larger

, where becomes well-conditioned, the former providesimproved estimation accuracy. However, CCD-WL offers the

Page 10: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

5916 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011

TABLE IEXPERIMENTAL RESULTS FOR SYNTHETIC QTL DATA

lowest MSE for every , and provides reasonably accurate es-timates even for the underdetermined case .

Performance of the sequential estimator in Section III-B wasassessed in the same setup. Fig. 1(b) illustrates the MSE conver-gence, averaged over 100 Monte Carlo runs, for the followingthree recursive algorithms: i) the conventional RLS of (15); ii)the cyclic coordinate descent recursive Lasso (CCD-RL); andiii) its weighted version (CCD-RWL). Since the system wastime-invariant, the forgetting factor was set to . It canbe observed that the conclusions drawn for the batch case carryover to the recursive algorithms too. Moreover, a comparisonof Fig. 1(a) and (b) indicates that the sparsity-aware iterates ofTable II approximate closely the exact per time instance problemin (16).

B. Multilinear Regression for GWA Analysis

Here, we test sparse polynomial modeling for studying theepistatic effects in quantitative trait analysis. In quantitativegenetics, the phenotype is a quantitative trait of an organism,e.g., the weight or height of barley seeds [26]. Ignoring envi-ronmental effects, the phenotype is assumed to follow a linearregression model over the individual’s genotype, includingsingle-gene (main) and gene-gene (epistatic) effects [9], [28].The genotype consists of markers which are samples of chro-mosomes taking usually binary { 1} values. Determiningthe so-called quantitative trait loci (QTL) corresponds to de-tecting the genes and pairs of genes associated with a particulartrait [28]. Since the studied population is much smaller thanthe number of regressors , and postulating that only a fewgenotype effects determine the trait considered, QTL analysisfalls under the sparse multilinear (for ) model of (33).

1) Synthetic Data: The first QTL paradigm is a syntheticstudy detailed in [28]. A population of 600 individuals issimulated for a chromosome of 1800 cM (centiMorgan) evenlysampled every 15 cM to yield 121 markers. The true pop-ulation mean and variance are 5.0 and 10.0, respectively. Thephenotype is assumed to be linearly expressed over the inter-cept, the main effects, and the 7260 epistatic effects,leading to a total of 7382 regressors. The QTLs simulatedare 9 single markers and 13 marker pairs. Note that the sim-ulation accommodates markers i) with main only, ii) epistaticonly, and iii) both main and epistatic effects. Since the interceptis not regularized, genotype and phenotype data were centered,i.e., their sample mean was subtracted, and the intercept was de-termined at the end as the sample mean of the initial I/O data onthe fitted model.

Parameters and for ridge and (w)Lasso estimators, re-spectively, were tuned through tenfold cross-validation over an100-point grid [13]; see Table I. The figure of merit for selectingthe parameters was the prediction error (PE) over the unseen

data, i.e., , where and is the

Fig. 2. Regression vector estimates for the synthetic gene data. The main(epistatic) effects are shown on the diagonal (left diagonal part), while red(green) bars correspond to positive (negative) entries. (a) True model; (b) Ridgeregression; (c) Lasso; and (d) wLasso.

TABLE IIEXPERIMENTAL RESULTS FOR REAL QTL BARLEY DATA

regression vector estimated given all but the valida-tion data. The value of attaining the smallest PE was subse-quently used for determining the weights for the wLasso es-timator. Having tuned the regularization parameters, the MSEprovided by the three methods was averaged over 100 MonteCarlo runs on different phenotypic data while keeping the geno-types fixed. The (w)Lasso estimators were run using the glmnetsoftware [12]. Each of the three algorithms took less than 1 minand 1 s for cross-validation and final estimation, respectively.

As can be seen from Table I, Lasso attains the smaller PE.However, wLasso provides significantly higher estimation accu-racy at a PE value comparable to Lasso. The number of nonzeroregression coefficients indicated in the fourth column shows thatridge regression yields an oversaturated model. As shown moreclearly in Fig. 2, where the true and the estimated models areplotted, the wLasso yields a sparser, closer to the true model,while avoiding some spurious coefficients found by Lasso.

2) Real Data From a Barley Experiment: The second QTLexperiment entails a real dataset collected by the North Amer-ican Barley Genome Mapping Project as described in [26], [29],and outlined shortly next. Aiming at a GWA analysis on barleyheight (HGT), the population consists of doubled-haploid lines of a cross between two barley lines, Harringtonand TR306. The height of each individual was measured under27 different environments, and the phenotype was taken to bethe sample average. There are 127 markers covering a1270 cM segment of the genome with an average marker intervalof 10.5 cM. The genotype is binary: 1 ( 1) for the TR306(Harrington) allele. There is a 5% of missing values which aremodeled as zeros in order to minimize their effect [28]. The

Page 11: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

KEKATOS AND GIANNAKIS: SPARSE VOLTERRA AND POLYNOMIAL REGRESSION MODELS: RECOVERABILITY AND ESTIMATION 5917

Fig. 3. Regression vector estimates for the real QTL barley data. The main(epistatic) effects are shown on the diagonal (left diagonal part), while red(green) bars correspond to positive (negative) entries. (a) Lasso and (b) wLasso.

TABLE IIIQTLS ESTIMATED BY WLASSO FOR THE REAL BARLEY DATA

main and epistatic QTL analysis involves8129 regressors.

The regularization parameter values were selected throughleave-one-out cross-validation [13]; see Table II. The ridgeestimator fails to handle overfitting and is set to a large valueyielding regression coefficients of insignificant amplitude.Using the ridge estimates to weight the regression coefficients,wLasso yields a PE slighty smaller than the one attained byLasso; but it reduces the spurious coefficients. As shown inFig. 3, wLasso provides a more parsimonious model with fewerspurious peaks than the Lasso-inferred model. Closer investi-gation of the wLasso QTLs exceeding in magnitude, shownin Table III, offers the following interesting observations: i)epistatic effects are not negligible; ii) there are epistatic effectsrelated to QTLs with main effects, e.g., the (35,99) pair isrelated to marker (101); and iii) there are epistatic effects suchas the (9,33) one involving markers with no main effect.

VI. CONCLUSION

The idea of exploiting sparsity in the representation ofa system, already widely adopted for linear regression andsystem identification, has been permeated here to estimatesparse Volterra and polynomial models. The abundance ofapplications allowing for an interpretative parsimonious poly-nomial expansion and the inability of kernel regression toyield such an expansion necessitate sparsity-aware polynomialestimators. This need was successfully met here both frompractical and analytical perspectives. Algorithmically, theproblem was solved via the batch (weighted) Lasso estimators,where for the weighted one, the weights were efficiently foundthrough the kernel trick. To further reduce the computationaland memory load and enable tracking, an adaptive sparseRLS-type algorithm was devised. On the analytical side, RIPanalysis was carried out for the two models. It was shown that

an -sparse linear-quadratic Volterra filter can be recovered withhigh probability using measurements in the order of ; abound that interestingly generalizes the results from the linearfiltering problem to the Volterra one. For the sparse polyno-mial expansions considered, the bound improved to ,which also generalizes the corresponding linear regressionresults. The potential of the aforementioned sparse estimationmethods was numerically verified through synthetic and realdata. The developed sparse adaptive algorithms converged fastto the exact solution, while the (weighted) Lasso estimatorsoutperformed the LS-based one in all simulated scenarios, aswell as in the GWA study on real barley data. Future researchdirections include extending the bounds derived to higher-ordermodels, and utilizing our adaptive methods to accomplishepistatic GWA studies on the considerably higher dimensionalhuman genome.

APPENDIX

Outlining some tools regarding concentration inequalitiesprecede the proof of Theorem 1.

Lemma 4 (Hoeffding’s Inequality): Given and inde-pendent random variables bounded asalmost surely, the sum satisfies

(36)

It is essentially a Chernoff-type result on the concentrationof a sum of independent bounded random variables aroundits mean. However, the subsequent analysis on the RIP of theVolterra filter considers sums of structurally dependent randomvariables. Useful probability bounds on such sums can bederived based on the following lemma.

Lemma 5 (Hoeffding’s Inequality With Dependent Sum-mands [21]): Consider random variables bounded as

almost surely. Assume also they can be partitionedinto collectively exhaustive and mutually exclusive subsets

with respective cardinalities such thatthe variables within each subset are independent. Then, for any

the sum satisfies

where .Note that the sharpness of the bound in Lemma 5 depends

on the number of subsets as well as the minimum of theircardinalities . One should not only strive for the minimumnumber of intra-independent subsets, but also arrange ’s asuniformly as possible. For example, partitioning with the min-imum number of subsets may yield that correspondsto a loose bound.

The partitioning required in Lemma 5 is not always easy toconstruct. An interesting way to handle this construction is of-fered by graph theory as suggested in [21]. The link betweenstructural dependencies in a set of random variables andgraph theory hinges on their dependency graph . The latter isdefined as the graph having one vertex per , and an edge be-tween every pair of vertices corresponding to dependent ’s.

Page 12: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

5918 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011

Recall that the degree of a vertex is the number of edges at-tached to it, and the degree of a graph is the maximum ofthe vertex degrees. Finding group-wise statistical independenceamong random variables can be seen as a coloring of the depen-dency graph. The problem of coloring aims at assigning everyvertex of a graph to a color (class) such that there are no adjacentvertices sharing the same color. Moreover, coloring of a graphis equitable if the cardinality of every color does not differ bymore than one from the cardinalities of every other color. Thus,an -equitable coloring of the dependency graph means thatthe random variables can be partitioned in intra-independentsubsets whose cardinalities are either or . A keytheorem by Hajnal and Szemeredi guarantees that a graphhas an -equitable coloring for all ; see, e.g.,[21]. Combining this result with Lemma 5, yields the followingcorollary.

Corollary 1 (Hoeffding’s Inequality and DependencyGraph [14], [21]): Consider random variablesbounded as . Assume also that their dependencygraph has degree . Then, the sum satisfiesfor every integer and

Having presented the necessary tools, the proof of Theorem1 is presented next.

Proof of Theorem 1: Consider a specific realization of andits Grammian . As guaranteed by the Gersgorin disc theorem,if and for every with while

for some , then possesses RIP[14]. Thus, the probability of not satisfying RIP of value ,

, can be upper bounded by

Apparently, the events above are not independent. Since issymmetric, the union bound can be applied for only its lowertriangular part; thus, is upper bounded by

(37)

Our next goal is to upper bound the probabilities appearing in(37). Different from the analysis in [14] for the linear case, theentries of exhibit different statistical properties dependingon the components (constant, linear, quadratic, bilinear) of thenonlinear system they correspond to. To signify the difference,we will adopt the notation instead of , where and

can be any of , to indicate that the entry is theinner product between the th and the th columns of , but alsothe th( th) column comes from the part of the system. Forexample, the element is the inner product of a column ofwith a column of . Recall that satisfies the requirements

and for .We start with the diagonal entries , where each one of

them can be expressed as for some . Upon rec-

ognizing this quantity as a sum of independent random vari-ables confined in the interval , Hoeffding’s lemma canbe readily applied. The bound obtained is multiplied by to ac-count for all ’s; hence

(38)

Similarly, each one of the diagonal entries is equal tofor some , which is a sum of inde-

pendent random variables bounded in . Lemma 4 yields

(39)

Before proceeding with the bilinear diagonal entries, let usconsider first the off-diagonal entries . Each one of them is asum of the form for . However,the summands are not generally independent; every summandis a two-variable monomial and a single may appear in twosummands. This was handled in [14] after proving that canalways be split into two partial sums, each including indepen-dent terms. As a clarifying example, the entry can be ex-pressed as . More-over, the two partial sums contain and summands.Applying Lemma 5 for , , , and

, it follows that

(40)

Taking into account that for , and sincethere are off-diagonal terms, their collectiveprobability bound is

(41)

Returning to the bilinear diagonal entries, every canbe written as for some . Eventhough the summands are not independent, they exhibit iden-tical structural dependence observed in ’s; thus, the samesplitting trick can be applied here too. Upon using Lemma 5 for

, , , , and , and addingthe contribution of all bilinear diagonal entries,we end up with

(42)

Regarding the entries and , an immediate applicationof Hoeffding’s inequality yields

(43)

(44)

Page 13: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

KEKATOS AND GIANNAKIS: SPARSE VOLTERRA AND POLYNOMIAL REGRESSION MODELS: RECOVERABILITY AND ESTIMATION 5919

whereas the probabilities have been already

accounted for in the analysis of the ’s.

The entries are for

some and , where every summand lies in .Two sub-cases will be considered. The first corresponds tothe entries with (or equivalently ),in which every summand depends on a single input. ThroughLemma 4, the sum of probabilities related to these entriesis upper bounded by . The second case

includes the remaining entries with , forwhich the splitting trick can be applied to yield the bound

. Combining the two boundsyields

(45)

The entries can be expressed asfor some

, where each summand is bounded in .Exploiting the same splitting trick and summing up thecontributions of all the entries, yields

(46)

The ’s are for some

and , while every summand lies in . Note

that there exist ’s with summands being two-input mono-mials, i.e., for or . However, to simplify the pre-sentation, the derived bound is slightly loosened by consideringall ’s as sums of three-input monomials. This specific struc-ture precludes the application of the splitting procedure into twohalves, and necessitates use of the dependency graph. It can beshown that the degree of the dependency graph associated withthe three-variable products for any entry is at most 6. Then,

application of Corollary 1 over the entries to-gether with the inequality , which holds for ,yield

(47)

The ’s are for

some and , where the summands lie in .

Following a reasoning similar to the one for ,

(48)Finally, any entry is expressed as a sum of four-input

monomials where each summand lies in ; thus, the

degree of the associated dependency graph is at most 12. Uponapplying Corollary 1 over the ’s, and since

for , we obtain

(49)Adding together the bounds for the diagonal elements (38),

(39), and (42), implies

(50)

for . For the off-diagonal elements, upon adding (41),(43)–(49), it follows for that

(51)

By choosing , the arguments of the exponentialsin (50) and (51) become equal, and after adding the two bounds,we arrive at

(52)

Since translates to

for , the bound in (52) simplifies to

(53)

Now set and choose any . Whenever, (53) yields

which completes the proof.

ACKNOWLEDGMENT

The authors would like to thank Dr. D. Angelosante andProf. X. Cai for valuable feedback on the contributions of thispaper.

REFERENCES

[1] D. Angelosante, J. A. Bazerque, and G. B. Giannakis, “Online adaptiveestimation of sparse signals: Where RLS meets the � -norm,” IEEETrans. Signal Process., vol. 58, no. 7, pp. 3436–3447, Jul. 2010.

[2] S. Benedetto and E. Biglieri, “Nonlinear equalization of digital satellitechannels,” IEEE J. Sel. Areas Commun., vol. SAC-1, no. 1, pp. 57–62,Jan. 1983.

[3] T. W. Berger, D. Song, R. H. M. Chan, and V. Z. Marmarelis, “Theneurobiological basis of cognition: Identification by multi-input, multi-output nonlinear dynamic modeling,” Proc. IEEE, vol. 98, no. 3, pp.356–374, Mar. 2010.

[4] P. J. Bickel, Y. Ritov, and A. B. Tsybakov, “Simultaneous analysisof Lasso and Dantzig selector,” Ann. Statist., vol. 37, no. 4, pp.1705–1732, 2009.

Page 14: IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12 ... · IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011 5907 Sparse Volterra and Polynomial Regression

5920 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 12, DECEMBER 2011

[5] E. J. Candès, “The restricted isometry property and its implicationsfor compressed sensing,” Compte Rendus de l’Academie des Sciences,Paris, Serie I, vol. 346, pp. 589–592, 2008.

[6] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEETrans. Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.

[7] E. J. Candès and T. Tao, “The Dantzig selector: Statistical estima-tion when � is much larger than �,” Ann. Statist., vol. 35, no. 6, pp.2313–2351, Dec. 2007.

[8] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decompositionby basis pursuit,” SIAM J. Sci. Comput., vol. 20, pp. 33–61, 1999.

[9] H. J. Cordell, “Detecting gene-gene interactions that underlie humandiseases,” Nature Rev. Genetics, vol. 10, pp. 392–404, Jun. 2009.

[10] J. Fan and R. Li, “Variable selection via nonconcave penalized like-lihood and its oracle properties,” J. Amer. Statist. Assoc., vol. 96, no.456, pp. 1348–1360, Dec. 2001.

[11] M. Franz and B. Scholkopf, “A unifying view of Wiener and Volterratheory and polynomial kernel regression,” Neural Comput., vol. 18, pp.3097–3118, 2006.

[12] J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani, “Pathwise co-ordinate optimization,” Ann. Appl. Statist., vol. 1, pp. 302–332, Dec.2007.

[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of StatisticalLearning: Data Mining, Inference, and Prediction, ser. Springer Seriesin Statistics. New York: Springer, 2009.

[14] J. Haupt, W. U. Bajwa, G. Raz, and R. Nowak, “Toeplitz compressedsensing matrices with applications to sparse channel estimation,” IEEETrans. Inf. Theory, vol. 56, no. 11, pp. 5862–5875, Nov. 2010.

[15] V. Kekatos, D. Angelosante, and G. B. Giannakis, “Sparsity-awareestimation of nonlinear Volterra kernels,” presented at the CAMSAP,Aruba, Dutch Antilles, Dec. 2009.

[16] V. Mathews and G. Sicuranza, Polynomial Signal Processing. NewYork: Wiley, 2000.

[17] G. Mileounis, B. Babadi, N. Kalouptsidis, and V. Tarokh, “An adaptivegreedy algorithm with application to nonlinear communications,” IEEETrans. Signal Process., vol. 58, no. 6, pp. 2998–3007, Jun. 2010.

[18] B. Nazer and R. Nowak, “Sparse interactions: Identifying high-dimen-sional multilinear systems via compressed sensing,” presented at theAllerton Conf., Monticello, IL, 2010.

[19] R. Nowak and B. V. Veen, “Invertibility of higher order moment ma-trices,” IEEE Trans. Signal Process., vol. 43, no. 3, pp. 705–708, Mar.1995.

[20] G. Palm and T. Poggio, “The Volterra representation and the Wienerexpansion: Validity and pitfalls,” SIAM J. Appl. Math., vol. 33, no. 2,pp. 195–216, Sep. 1977.

[21] S. Pemmaraju, “Equitable coloring extends Chernoff-Hoeffdingbounds,” in Proc. RANDOM-APPROX 2001, Berkeley, CA, Aug.2001, pp. 285–296.

[22] H. Rauhut, “Compressive sensing and structured random matrices,” inTheoretical Foundations and Numerical Methods for Sparse Recovery,ser. Radon Series Comp. Appl. Math., M. Fornasier, Ed. New York:De Gruyter, 2010, vol. 9, pp. 1–92.

[23] B. Scholkopf and A. J. Smola, Learning With Kernels: Support VectorMachines, Regularization, Optimization, and Beyond. Cambridge,MA: MIT Press, 2002.

[24] D. Song, H. Wang, and T. W. Berger, “Estimating sparse Volterramodels using group � -regularization,” in Proc. IEEE Intl. Conf.Engnr. in Medicine and Biology Society (EMBC), Buenos Aires,Argentina, Sep. 2010, pp. 4128–4131.

[25] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” J.Roy. Statist. Soc. Ser., vol. 58, no. 1, pp. 267–288, 1996.

[26] N. A. Tinker, D. E. Mather, B. G. Rossnagel, K. J. Kasha, A. Klein-hofs, P. M. Hayes, and D. E. Falk, “Regions of the genome that af-fect agronomic performance in two-row barley,” Crop Sci., vol. 36, pp.1053–1062, 1996.

[27] T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange,“Genome-wide association analysis by Lasso penalized logisticregression,” Bioinformatics, vol. 25, no. 6, pp. 714–721, 2009.

[28] S. Xu, “An empirical Bayes method for estimating epistatic effects ofquantitative trait loci,” Biometrics, vol. 63, no. 2, pp. 513–521, 2007.

[29] S. Xu and Z. Jia, “Genomewide analysis of epistatic effects for quali-tative traits in barley,” Genetics, vol. 175, pp. 1955–1963, 2007.

[30] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,” J. Roy. Statist. Soc. B, vol. 68, no. 1, pp. 49–67,2006.

[31] M. Zeller and W. Kellermann, “Fast and robust adaptation of DFT-do-main Volterra filters in diagonal coordinates using iterated coefficientupdates,” IEEE Trans. Signal Process., vol. 58, no. 3, pp. 1589–1604,Mar. 2010.

[32] H. Zou, “The adaptive Lasso and its oracle properties,” J. Amer. Statist.Assoc., vol. 101, no. 476, pp. 1418–1429, Dec. 2006.

Vassilis Kekatos (M’10) was born in Athens, Greece,in 1978. He received the Diploma, MSc., and Ph.D.degrees in computer engineering and computer sci-ence from the University of Patras, Greece, in 2001,2003, and 2007, respectively.

Since 2009, he has been a Marie Curie Fellow,and he is currently a Postdoctoral Associate with theDepartment of Electrical and Computer Engineeringof the University of Minnesota, Minneapolis, andthe Department of Computer Engineering andInformatics, University of Patras, Greece.

Georgios B. Giannakis (F’97) received the Diplomadegree in electrical engineering from the NationalTechnical University of Athens, Greece, in 1981and the M.Sc. degree in electrical engineering, theM.Sc. degree in mathematics, and the Ph.D. degreein electrical engineering from the University ofSouthern California (USC) in 1983, 1986, and 1986,respectively.

Since 1999, he has been a Professor with theUniversity of Minnesota, Minneapolis, where hecurrently holds an ADC Chair in Wireless Telecom-

munications in the Electric and Computer Engineering Department and servesas Director of the Digital Technology Center. His general interests span theareas of communications, networking and statistical signal processing subjectson which he has published more than 300 journal papers, 500 conferencepapers, 20 book chapters, two edited books and two research monographs.Current research focuses on compressive sensing, cognitive radios, networkcoding, cross-layer designs, wireless sensors, social and power grid networks.

Dr. Giannakis is the (co-)inventor of 20 patents issued, and the (co)recipient ofseven paper awards from the IEEE Signal Processing (SP) and CommunicationsSocieties, including the G. Marconi Prize Paper Award in Wireless Communi-cations. He also received Technical Achievement Awards from the SP Society(2000), from EURASIP (2005), a Young Faculty Teaching Award, and the G.W. Taylor Award for Distinguished Research from the University of Minnesota.He is a Fellow of EURASIP and has served the IEEE in a number of posts, in-cluding that of a Distinguished Lecturer for the IEEE-SP Society.


Recommended