+ All Categories
Home > Documents > Sparse Modal Additive Model

Sparse Modal Additive Model

Date post: 24-Nov-2023
Category:
Upload: khangminh22
View: 0 times
Download: 0 times
Share this document with a friend
15
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Sparse Modal Additive Model Hong Chen , Yingjie Wang, Feng Zheng , Member, IEEE, Cheng Deng , Senior Member, IEEE, and Heng Huang Abstract— Sparse additive models have been successfully applied to high-dimensional data analysis due to the flexibility and interpretability of their representation. However, the existing methods are often formulated using the least-squares loss with learning the conditional mean, which is sensitive to data with the non-Gaussian noises, e.g., skewed noise, heavy-tailed noise, and outliers. To tackle this problem, we propose a new robust regres- sion method, called as sparse modal additive model (SpMAM), by integrating the modal regression metric, the data-dependent hypothesis space, and the weighted q,1 -norm regularizer (q 1) into the additive models. Specifically, the modal regression metric assures the model robustness to complex noises via learning the conditional mode, the data-dependent hypothesis space offers the model adaptivity via sample-based presentation, and the q,1 -norm regularizer addresses the algorithmic interpretability via sparse variable selection. In theory, the proposed SpMAM enjoys statistical guarantees on asymptotic consistency for regres- sion estimation and variable selection simultaneously. Experimen- tal results on both synthetic and real-world benchmark data sets validate the effectiveness and robustness of the proposed model. Index Terms— Additive models, data-dependent hypothesis space, generalization bound, modal regression, variable selection. I. I NTRODUCTION A DDITIVE models [1], [2], rooted in nonparametric esti- mation, are powerful tools for prediction (e.g., regression and classification) and inference (e.g., variable selection) tasks in high-dimensional data analysis. Sparseness rose in additive models with the growing importance of model interpretabil- ity and several sparse additive learning algorithms, such as Manuscript received January 8, 2019; revised October 30, 2019 and February 29, 2020; accepted March 5, 2020. This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 11671161 and Grant 61972188. (Corresponding authors: Feng Zheng; Cheng Deng.) Hong Chen is with the College of Science, Huazhong Agricultural Univer- sity, Wuhan 430070, China, and also with the Hubei Engineering Technology Research Center of Agricultural Big Data, Huazhong Agricultural University, Wuhan 430070, China (e-mail: [email protected]). Yingjie Wang is with the College of Informatics, Huazhong Agricultural University, Wuhan 430070, China (e-mail: [email protected]). Feng Zheng is with the Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China, and also with the Research Institute of Trustworthy Autonomous Systems, Shenzhen 518055, China (e-mail: [email protected]). Cheng Deng is with the School of Electronic Engineering, Xidian Univer- sity, Xi’an 710071, China (e-mail: [email protected]). Heng Huang is with the Department of Electrical and Computer Engi- neering, University of Pittsburgh, Pittsburgh, PA 15260 USA, and also with JD Finance America Corporation, Mountain View, CA 94043 USA (e-mail: [email protected]). This article has supplementary downloadable material available at http://ieeexplore.ieee.org, provided by the authors. Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2020.3005144 SpAM [3], SAM [4], and GroupSAM [5], have been suc- cessfully introduced to address different applications. These methods are attractive due to their flexible representation, interpretable results, and theoretical guarantees, e.g., asymp- totic consistency on estimation and variable selection. In machine learning literature, many additive models have been formulated under the Tikhonov regularization scheme, where the hypothesis function space, the empirical risk, and the regularization penalty are important building blocks for the algorithmic design. The hypothesis space in additive models is required to have additivity structure, which is useful to address “the curse of dimensionality” issue [1], [6], [7] and provides the interpretable representation. Usually, the additive structure is obtained by decomposing the input space X R p into component spaces {X j } p j =1 and defining the hypothesis function as the sum of component functions f j : X j R. In real-world applications, each component function f j lies in a reproducing kernel Hilbert space (RKHS) [5], [6], [8], [9] or the space spanned by the orthogonal basis [3], [10]–[12]. The empirical risk in additive models follows the line of convex risk minimization, where the most popular error metrics are the least-squares loss for regression [1], [3], [12], [13] and the hinge loss for classification [4], [5]. Essentially, the empirical risk minimization strategy used by these error metrics aims to approximate the mean function, which is the optimal criterion for training data satisfying the Gaussian noise assumption. The regularization penalties in additive models usually include the sparsity-inducing regularization term (e.g., the 1 -norm regu- larizer and the 2,1 -norm regularizer [3], [5], [11], [12]) and the smooth-inducing regularization term (e.g., the kernel-norm regularizer [9], [14]). Although the aforementioned studies have offered various additive models with promising behaviors, the previous works are typically concerned with finding a real-valued function f : X R such that it approximates the conditional mean as much as possible. It is well known that learning models aiming at estimating the conditional mean have degraded performance when facing data with non-Gaussian noise, e.g., the heavy-tailed noise and the skewed noise. This motivates us to learn a new statistical quantity (conditional mode) to replace the existing purpose of estimating the conditional mean. To the best of our knowledge, this is the first work that considers the mode-based additive models. To estimate the conditional mode function, various modal regression algorithms are constructed by local approaches [18]–[20] and global approaches [16], [21]. Usually, these existing methods are formulated under a global mode assump- tion [22], [23] or a local mode condition [19], [20]. Global 2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.
Transcript

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Sparse Modal Additive ModelHong Chen , Yingjie Wang, Feng Zheng , Member, IEEE, Cheng Deng , Senior Member, IEEE,

and Heng Huang

Abstract— Sparse additive models have been successfullyapplied to high-dimensional data analysis due to the flexibilityand interpretability of their representation. However, the existingmethods are often formulated using the least-squares loss withlearning the conditional mean, which is sensitive to data with thenon-Gaussian noises, e.g., skewed noise, heavy-tailed noise, andoutliers. To tackle this problem, we propose a new robust regres-sion method, called as sparse modal additive model (SpMAM),by integrating the modal regression metric, the data-dependenthypothesis space, and the weighted �q,1-norm regularizer (q ≥ 1)into the additive models. Specifically, the modal regression metricassures the model robustness to complex noises via learning theconditional mode, the data-dependent hypothesis space offersthe model adaptivity via sample-based presentation, and the�q,1-norm regularizer addresses the algorithmic interpretabilityvia sparse variable selection. In theory, the proposed SpMAMenjoys statistical guarantees on asymptotic consistency for regres-sion estimation and variable selection simultaneously. Experimen-tal results on both synthetic and real-world benchmark data setsvalidate the effectiveness and robustness of the proposed model.

Index Terms— Additive models, data-dependent hypothesisspace, generalization bound, modal regression, variable selection.

I. INTRODUCTION

ADDITIVE models [1], [2], rooted in nonparametric esti-mation, are powerful tools for prediction (e.g., regression

and classification) and inference (e.g., variable selection) tasksin high-dimensional data analysis. Sparseness rose in additivemodels with the growing importance of model interpretabil-ity and several sparse additive learning algorithms, such as

Manuscript received January 8, 2019; revised October 30, 2019 andFebruary 29, 2020; accepted March 5, 2020. This work was supported inpart by the National Natural Science Foundation of China (NSFC) underGrant 11671161 and Grant 61972188. (Corresponding authors: Feng Zheng;Cheng Deng.)

Hong Chen is with the College of Science, Huazhong Agricultural Univer-sity, Wuhan 430070, China, and also with the Hubei Engineering TechnologyResearch Center of Agricultural Big Data, Huazhong Agricultural University,Wuhan 430070, China (e-mail: [email protected]).

Yingjie Wang is with the College of Informatics, Huazhong AgriculturalUniversity, Wuhan 430070, China (e-mail: [email protected]).

Feng Zheng is with the Department of Computer Science and Engineering,Southern University of Science and Technology, Shenzhen 518055, China,and also with the Research Institute of Trustworthy Autonomous Systems,Shenzhen 518055, China (e-mail: [email protected]).

Cheng Deng is with the School of Electronic Engineering, Xidian Univer-sity, Xi’an 710071, China (e-mail: [email protected]).

Heng Huang is with the Department of Electrical and Computer Engi-neering, University of Pittsburgh, Pittsburgh, PA 15260 USA, and also withJD Finance America Corporation, Mountain View, CA 94043 USA (e-mail:[email protected]).

This article has supplementary downloadable material available athttp://ieeexplore.ieee.org, provided by the authors.

Color versions of one or more of the figures in this article are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2020.3005144

SpAM [3], SAM [4], and GroupSAM [5], have been suc-cessfully introduced to address different applications. Thesemethods are attractive due to their flexible representation,interpretable results, and theoretical guarantees, e.g., asymp-totic consistency on estimation and variable selection.

In machine learning literature, many additive models havebeen formulated under the Tikhonov regularization scheme,where the hypothesis function space, the empirical risk, andthe regularization penalty are important building blocks for thealgorithmic design. The hypothesis space in additive modelsis required to have additivity structure, which is useful toaddress “the curse of dimensionality” issue [1], [6], [7] andprovides the interpretable representation. Usually, the additivestructure is obtained by decomposing the input space X ∈ R

p

into component spaces {X j }pj=1 and defining the hypothesisfunction as the sum of component functions f j : X j → R.In real-world applications, each component function f j lies ina reproducing kernel Hilbert space (RKHS) [5], [6], [8], [9] orthe space spanned by the orthogonal basis [3], [10]–[12]. Theempirical risk in additive models follows the line of convexrisk minimization, where the most popular error metrics arethe least-squares loss for regression [1], [3], [12], [13] and thehinge loss for classification [4], [5]. Essentially, the empiricalrisk minimization strategy used by these error metrics aims toapproximate the mean function, which is the optimal criterionfor training data satisfying the Gaussian noise assumption. Theregularization penalties in additive models usually include thesparsity-inducing regularization term (e.g., the �1-norm regu-larizer and the �2,1-norm regularizer [3], [5], [11], [12]) andthe smooth-inducing regularization term (e.g., the kernel-normregularizer [9], [14]).

Although the aforementioned studies have offered variousadditive models with promising behaviors, the previous worksare typically concerned with finding a real-valued functionf : X → R such that it approximates the conditional meanas much as possible. It is well known that learning modelsaiming at estimating the conditional mean have degradedperformance when facing data with non-Gaussian noise, e.g.,the heavy-tailed noise and the skewed noise. This motivates usto learn a new statistical quantity (conditional mode) to replacethe existing purpose of estimating the conditional mean. To thebest of our knowledge, this is the first work that considers themode-based additive models.

To estimate the conditional mode function, various modalregression algorithms are constructed by local approaches[18]–[20] and global approaches [16], [21]. Usually, theseexisting methods are formulated under a global mode assump-tion [22], [23] or a local mode condition [19], [20]. Global

2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE I

PROPERTIES OF DIFFERENT REGRESSION ALGORITHMS

modal regression usually is constructed with a single func-tion, which tries to find the most likely value to describethe covariate-response relationship. Local modal regression,on the contrary, uses a multivalued function to mine morecomplicated relations when the covariate-response relationshiphas several distinct components. Beside rich statistical theory,some application-oriented studies on modal regression haveshown excellent performance for regression prediction (e.g.,nonparametric forecasting [24], traffic engineering [25], andcognitive impairment prediction [15]) and cluster estimation[26]. Recently, the learning theory analysis is established forthe kernel modal regression (KMR) [17] and the regularizedlinear modal regression [15]. Both of them in [15] and [17]can achieve learning rates with polynomial decay and havesuperior performance than the related least-squares methodsunder the mean squared error (MSE) criterion.

Following the research line of [5] and [15], we proposea novel sparse modal additive model (SpMAM) under theTikhonov regularization framework, which brings the modalregression metric associated with kernel density estimation(KDE), the data-dependent hypothesis space, and the sparse�q,1-regularizer (q ≥ 1) together in a natural way to con-duct estimation and variable selection simultaneously. Here,the modal regression metric assures the model robustnessto complex noises via learning the conditional mode, thedata-dependent hypothesis space offers the model adaptivityvia sample-based presentation, and the sparseness-induced�q,1-regularizer provides the algorithmic interpretability viaefficient variable selection. Indeed, the modal regression met-ric associated with the Gaussian KDE is related closely to theerror metric under the maximum correntropy criterion (MCC)[27]–[29]. As an additive structure extension on regularizedmodal regression (RMR) [15], the proposed model enjoys thefurther advantages brought by the data-dependent hypothesisspace and the �q,1-norm penalty together. The data-dependenthypothesis space has been well investigated from the view-point of learning theory in [30]–[32], which can offer muchadaptivity and flexibility for nonlinear learning algorithms.The �q,1-norm penalty (q ≥ 1) has been used extensivelyand successfully for sparse learning models, e.g., the �1-normregularizer for Lasso [33], SpAM [3], and SAM [4], the�2,1-regularizer for GroupSpAM [12] and GroupSAM [5].As far as we know, these building blocks (the mode-inducedmetric, the data-dependent hypothesis space, and the �q,1-normregularizer) have not been unified in a similar fashion before.

To better highlight the novelty of SpMAM, some algo-rithmic properties are summarized in Table I for our model

and other related methods, e.g., RMR [15], modal linearregression (MODLR) [16], KMR [17], COmponent Selectionand Smoothing Operator (COSSO) [13], and sparse additivemodels (SpAMs) [3].

One of the main features of the proposed SpMAM isthat it has theoretical guarantees on asymptotic consistencyfor regression estimation and variable selection without theGaussian noise assumption. This is a major advantage overmost existing SpAMs that are sensitive to the skewed noise,the heavy-tailed noise, and outliers. Based on a new errordecomposition and constructive analysis on the hypothesiserror, we prove that the SpMAM can get a learning ratewith O(n−(2)/(5)) under mild conditions, which is faster thanO(n−(1)/(7)) in [15] for RMR. In terms of the solution charac-teristic of SpMAM, we demonstrate that the proposed modelcan identify the truly informative variable in theory.

Another interesting feature of the proposed model is thatit can be implemented easily by integrating the half-quadratic(HQ) optimization [34] and the alternating direction methodof multiples (ADMM) [35]. The HQ technique is employedto resolve the computational difficulty induced by the modalregression metric, while the ADMM strategy is used to tacklethe corresponding regularization scheme.

Finally, the empirical effectiveness of the proposed model issupported by experimental evaluations. Empirical results showthat SpMAM can identify the truly informative variables andestimate the intrinsic function efficiently even for complexnoise data, e.g., the chi-square noise and the student noise.

The rest of this article is organized as follows. Section IIrecalls the background for modal regression and formulatesSpMAM. Section III establishes the theoretical results of theproposed model on the generalization bound and variableselection consistency. After offering the computing procedureof SpMAM in Section IV, we provide its numerical results inSection V. Finally, Section VI concludes this article.

II. SPARSE MODAL ADDITIVE MODEL

A. Problem Formulation

Let X ⊂ Rp be a compact input space and Y ⊂ R be an

output set. We consider the following nonparametric model:Y = f ∗(X)+ � (1)

where X ∈ X , Y ∈ Y , � is a random noise, and f ∗ : X → Yis the true regression function. For simplicity, we denote ρ asthe jointed distribution of (X, Y ) generated by (1) and set ρXas the corresponding marginal distribution with respect to X .

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: SpMAM 3

When the random noise satisfies E(�|X) = 0 (i.e., Gaussiannoise), the intrinsic regression function can be rewritten as theconditional mean of given X = x ∈ X , that is

f ∗(x) = E(Y |X = x) =∫Y

Y dρ(Y |X = x), x ∈ X . (2)

It is well known that the regression function in (2) also isthe minimizer of the expected risk

E( f ) = E(y − f (x))2 =∫X×Y

(y − f (x))2dρ(x, y) (3)

based on the least-squares loss [36]. Since the intrinsic distri-bution ρ is unknown, we cannot get the regression functiondirectly by minimizing the expected risk (3). In the statisticallearning setting, we only know the part information of ρ viaempirical observations z := {(xi , yi)}ni=1 ⊂ X × Y , which aredrawn independently of the unknown distribution ρ on X ×Y .Therefore, the empirical risk

Ez( f ) = 1

n

n∑i=1

(yi − f (xi))2

is used to measure the learning ability of estimatorf : X → R, and many learning algorithms are designed byminimizing Ez( f ) over some hypothesis spaces, e.g., linearfunction space and RKHS. Moreover, by addressing certainrestrictions on the hypothesis spaces, most existing learn-ing methods have been formulated under the structure riskminimization (SRM) principle, e.g., Lasso [33], kernel ridgeregression (KRR) [36], [37], and SpAM [3].

However, when the data sets are contaminated by somenon-Gaussian noises, the intrinsic target function f ∗ is notequivalent to the conditional mean. In such a situation, theleast-squares loss under the MSE criterion maybe not anappropriate error metric, and learning algorithms associatedwith Ez( f ) may result in degraded performance. To alleviatethe algorithmic sensitivity to complex noises, it is crucial todevelop new learning models under a robust metric criterion.

Instead of requiring E(�|X) = 0 in least-squares methods,the modal regression in [15]–[17] assumes that the conditionalmode of random noise � equals zero for any x ∈ X . That isto say

mode(�|X = x) := arg maxt∈R

p�|X (t|X = x) = 0 (4)

where p�|X is the conditional density of � given X . It should beremarked that the abovementioned zero-mode noise conditiondoes not assume either the homogeneity or the symmetryof the noise distribution and contains the skewed noise, theheavy-tailed noise, and outliers [17], [19].

By taking mode on the both sides of (1), we get the modalregression function

f ∗(x) = mode(Y |X = x)

= arg maxt

pY |X (t|X = x), x ∈ X (5)

where pY |X is the conditional density of Y given X . Through-out this article, we assume that the global mode for theconditional density pY |X exists and is unique. It has beenproven that f ∗ in (5) can be obtained by maximizing the joint

density pX,Y [16], [19], [20]. Especially, for the Gaussian noisesetting, the conditional mode in (5) equals to the conditionalmean in (2).

B. Modal Regression Metric

Based on the modal regression function in (5), it is naturalto measure the learning performance of f : X → R by thefollowing modal regression metric [15], [17]:

R( f ) =∫X

pY |X ( f (x)|X = x)dρX (x). (6)

It can be verified that the optimal estimator f ∗ in (5) is themaximizer of R( f ) over all measurable functions (also [17,Th. 3]). However, it is impossible to get the estimator directlyby maximizing this criterion R( f ) because both pY |X and ρXare unknown.

After introducing the random variable E f = Y− f (X), [17,Th. 5.1] states that

R( f ) = pE f (0)

where pE f (0) is the density function of E f at 0. Hence,we can transform the problem of maximizing R( f ) oversome hypothesis spaces into a task of maximizing the densitypE f at 0. This density pE f (0) can be estimated by the KDEassociated with a kernel Kσ : R×R→ R

+.For the feasibility of notation, we further denote the

kernel-based representing function φ((u − u�)/(σ )) =Kσ (u, u�)∀u, u� ∈ R, which usually satisfies for any u ∈ R,φ(u) = φ(−u), φ(u) > 0 and

∫R

φ(u)du = 1 [15], [17].It is easy to verify that the condition of representing functionis satisfied for the Gaussian kernel, the logistic kernel, andthe sigmoid kernel. Indeed, the choice of kernel function doesnot play an important role in KDE [38]. Therefore, in practice,a proper kernel is usually determined by lower prediction error.

Given empirical observations z := {(xi , yi)}ni=1 ⊂ X × Yand any f : X → R, we can approximate pE f (0) by thekernel density estimator pE f (0) := Rσ

z ( f ) defined as follows:

Rσz ( f ) = 1

n∑i=1

Kσ (yi − f (xi), 0)

= 1

n∑i=1

φ( yi − f (xi)

σ

). (7)

By taking expectation of Rσz (·) with respect to z, we could

further obtain the data-free modal regression metric

Rσ ( f ) = 1

σ

∫X×Y

φ( y − f (x)

σ

)dρ(x, y). (8)

Particularly, [17, Th. 10] assures that R( f )−Rσ ( f )→ 0 asσ → 0. In theory, for a predictor f : X → R, we can measureits excess generalization error R( f )−R( f ∗) in terms of thesurrogate regression metric Rσ ( f )−Rσ ( f ∗) [15].

Remark 1: Following [39] and [40], mode estimators can beconceptually classified as indirect and direct approach. Indirectestimators usually are constructed based on a suitable densityestimator, whereas direct modal estimators do not depend onestimating the density as a preliminary step. The direct mode

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 1. (a) Plots of the least-squares loss, the least absolute loss, andthe distance-based metric associated with the Gaussian kernel (ModeGau-metric with σ = 0.5). (b)–(d) Plots of the different distance-based metricswith the Gaussian kernel (ModeGau-metric), sigmoid kernel (ModeSig-metric),and logistic kernel (ModeLog -metric), respectively. Different parameters areconsidered, including σ = 0.5, 0.75, 1, 1.25.

estimators have been investigated in [41]–[44]. Especially, [44]stated that the mode function is not elicitable, which impliesthere is no scoring or loss function under which the mode isthe Bayes predictor. Clearly, the KDE-based mode estimatorcan be obtained by maximizing modal regression metric (7)over some certain hypothesis spaces, which belongs to theindirect approach.

Remark 2: Following the ideas of [28] and [45], thedistance-based metric Lσ : R→ [0,∞) is defined as:

Lσ (y − f (x)) = 1

σ

(φ(0)− φ

( y − f (x)

σ

)). (9)

It follows that Lσ (0) = 0, and its derivative satisfieslimt→∞ L�(t) = 0. This finding indicates the introduced metricessentially is a redescending M-estimator [46] as φ is differ-entiable. In addition, the plots of metrics (the distance-basedmetrics (9) with different density estimation kernel, the leastabsolute loss, and the least-squares loss) are illustrated inFig. 1, and their properties are further shown in Table II.

From the viewpoint of the maximum likelihood method,the least-squares loss and the least absolute loss are theoptimal error metrics for data with the Gaussian noise andthe Laplacian noise, respectively. Next to these losses, theintroduced metric is suitable for data with complex noise, e.g.,the skewed noise, the heavy-tailed noise, and outliers due toits special emphasis on the conditional mode.

C. Sparse Modal Additive Model

As a natural extension of linear model, additive modelsprovide much representation flexibility for regression estima-tion and variable selection by employing nonlinear hypothesisfunction spaces with an additive structure. The most typical

TABLE II

DEFINITIONS AND PROPERTIES OF DIFFERENT MEASURES

way to assure the additivity of predictive functions is todivide the input space X ∈ R

p into p parts {X j }pj=1 directly,where X = (X1, . . . ,Xp)

T . Indeed, the decomposition ofthe input space is an efficient way to circumvent “the curseof dimensionality” of nonparametric regression [1], [6], [7],which is one of motivations to impose the additive structureon hypothesis spaces. The hypothesis space with the additivestructure is formulated as

F ={

f : f (u) =p∑

j=1

f j (u j ), f j ∈ F j ,

u = (u1, . . . , u p)T ∈ R

p}

where each u j ∈ X j and F j is the univariate componentfunction space on X j . Typical examples of F j contain thespline-based function space [3], [4], [11] and RKHS [5],[8], [9], [14]. Indeed, the spline-based approaches require tospecify the number of basis functions and the sequence ofknots, while the kernel methods can be implemented underfew tuning parameters [37], [47].

In this article, we choose HK j , j = 1, . . . , p to formthe additive hypothesis space, where each HK j is an RKHSassociated with a Mercer kernel K j : X j × X j → R anda kernel norm · K [36]. Recall that the Mercer kernel issymmetric and positive semidefinite. It should be noticed that

HK ={ p∑

j=1

f j : f j ∈ HK j , 1 ≤ j ≤ p

}

with f 2K = inf{∑p

j=1 f j2K j: f =∑p

j=1 f j } also being anRKHS with respect to the additive kernel K =∑p

j=1 K j [9],[14]. For given training samples z = {(xi , yi)}ni=1, the modalregression model in HK can be formulated as

fz := fz,η = arg maxf ∈HK

{Rσ

z ( f )− η

p∑j=1

τ j f j2K j

}(10)

where η > 0 is a regularization parameter and τ j > 0 is theweight for the j th kernel norm. The representer theorem [48],[49] guarantees that fz in (10) can be represented as

fz =p∑

j=1

n∑i=1

αzj i K j(xi j , ·), αz

j,i ∈ R (11)

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: SpMAM 5

where xi = (xi1, xi2, . . . , xip)T ∈ R

p for any i ∈ {1, 2, . . . , n}.Observe that the j th component function of fz equals to

zero if αzj,i = 0 for all i ∈ {1, 2, . . . , n}. Hence, it is

natural to expect αzjq = (

∑ni=1 |αz

j,i |q)1/q = 0 for αzj =

(αzj,1, . . . , α

zj,n)

T ∈ Rn when the j th input variable is not truly

informative. To enhance the model sparsity and interpretability,we consider the sparsity-induced penalty

q( f ) = inf

{ p∑j=1

τ jα jq : f

=p∑

j=1

n∑i=1

α j i K j(xi j , ·)}, q ≥ 1 (12)

in the data-dependent hypothesis space

HK ,z ={ p∑

j=1

n∑i=1

α j i K j(xi j , ·) : α j i ∈ R

}. (13)

Then, the SpMAM can be formulated as

fz := fz,λ = arg maxf ∈HK ,z

{Rσ

z ( f )− λq( f )}

(14)

where λ > 0 is a tradeoff parameter and Rσz ( f ) is the

empirical metric defined in (7).Denote K j i = (K j (x1 j , xi j), . . . , K j (xnj , xi j))

T ∈ Rn and

α j = (α j1, . . . , α jn)T ∈ R

n. The SpMAM (14) can berewritten as

fz =p∑

j=1

fz, j =p∑

j=1

n∑i=1

αzj i K j (xi j, ·)

with

{αzj } = arg max

α j ,1≤ j≤p

{1

n∑i=1

φ

(yi −∑p

j=1 KTjiα j

σ

)

− λ

p∑j=1

τ jα jq

}. (15)

Indeed, the learning schemes in (14) and (15) are formulatedfrom the viewpoints of function approximation and coefficientestimation, respectively.

Remark 3: When q = 2, our model (14) is related closelyto GroupSAM in [5], considering the hinge loss for binaryclassification in a data-dependent hypothesis space. Indeed,both SpAM in [3] and GroupSAM in [5] aim to learn theconditional mean. When q = 1, the proposed model isalso related to coefficient-based kernel regression with �1-regularizer [30], [32], [50], but it searches a linear combinationof basis functions and does not consider the variable selection.In particular, for Kσ (t, 0) = φ((t)/(σ )) = exp(−(t2)/(2σ 2))and τ j ≡ 1 for any j ∈ {1, . . . , p}, (15) can be rewritten as

{αzj } = arg max

α j

{1

n∑i=1

exp(− (yi −∑p

j=1 KTjiα j )

2

2σ 2

)

− λ

p∑j=1

α jq

}.

For given σ and q = 1, the abovementioned optimizationextends the sparse correntropy regression in [45] to additive

models, which enables to circumvent “the curse of dimension-ality” in the nonparametric regression [6], [7] and identify thetruly informative variables [15].

Remark 4: The proposed framework can be extended to thegroup sparse setting directly by replacing {X j}pj=1 with thesubgroups {X ( j)} pj=1, where each X ( j) denotes the componentinput space concerning the interactions among variables. Typ-ical ways to obtain the grouped component spaces include thefunctional ANOVA model for exploring d-order interactions(1 ≤ d ≤ p) [6], [13] and the input decomposition from priorknowledge [5], [12], [51].

Remark 5: The optimization problem (15) can be solvedby integrating the HQ optimization [34], [52] and theADMM [35]. Moreover, the active variable set is defined as

Jz = { j : αzjq > vn} (16)

where vn is a positive threshold obtained via the stability-basedselection strategy [53]. We provide the detail optimizationalgorithm in Section IV.

III. LEARNING THEORY ANALYSIS

A. Main Theoretical Results

This section focuses on bounding the excess generalizationerror R( f ∗) −R( fz) and investigating the variable selectionability of SpMAM. We first recall some necessary assumptionsused or discussed in [15] and [17].

Assumption 1: The representing function φ, associated withmodal kernel Kσ : R × R → R

+ with a scale parameterσ , satisfies the following conditions: 1) ∀u, φ(u) = φ(−u),φ(u) ≤ φ(0), and

∫R

φ(u)du = 1; 2) φ is bounded anddifferentiable with φ�∞ <∞; and 3)

∫R u2φ(u)du <∞.

Some familiar kernels used for density estimation satisfythe requirements of Assumption 1, e.g., the Gaussian kernel,sigmoid kernel, and logistic kernel.

Assumption 2: The conditional density of � given x isbounded and second-order continuously differentiable, whichsatisfies that p���|X∞ <∞.Assumption 2 has been used in [17], which is key to bridgeR( f ) and Rσ ( f ).

Since HK j , j ∈ {1, . . . , p} are employed to form theadditive hypothesis space, it is natural to make the followingrestriction for the target function.

Assumption 3: Assume that f ∗ = ∑pj=1 f ∗j with f ∗j ∈

HK j , κ = sup j,u(K j(u, u))1/2 < ∞, and the output setY ⊂ [−M, M] for some M > 0

Next, we present the main result on the generalization error.Theorem 1: Let c1 ≤ min1≤ j≤p τ j ≤ max1≤ j≤p τ j ≤ c2 for

some positive constants c1, c2, and each K j ∈ Cν , K j∞ <∞ for j ∈ {1, . . . , p}. Under Assumptions 1–3, taking σ =n−(1)/(5), λ = n−ζ with 0 < ζ ≤ min{(5 − 3s − (s)/(2)(1 −(1)/(q)))/(5s), 8−5s−10s(1−(1)/(q))

10s }, we have, for any 0 < δ < 1

R( f ∗)−R( fz) ≤ Cn−ϑ log(1/δ)

with confidence at least 1 − δ, where ϑ = min{(5 − 5sζ −

3s − (s)/(2)(1 − (1)/(q)))/(10), (8 − 5s − 10sζ − 10s(1 −

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

(1)/(q)))/(10+ 5s), (3+ 5ζ )/(10)− (1)/(2q), (2)/(5)}

with

s =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

2

1+ 2ν, ν ∈ (0, 1]

2

1+ ν, ν ∈ (1, 3/2]

1

ν, ν ∈ (3/2,∞)

(17)

and C is a positive constant independent of n and δ.Remark 6: Theorem 1 illustrates that the proposed SpMAM

can achieve the polynomial decay rate for the excess gener-alization error. The convergence rate depends heavily on theproperties of component kernel K j , the modal kernel Kσ , andthe tradeoff parameter λ.

To better understand Theorem 1, we present a refined resultwith proper parameter selection.

Corollary 1: Let the conditions of Theorem 1 be true.Assume that K j ∈ C∞ for any j ∈ {1, . . . , p}. Takingζ ≥ 1

5 + 1q , we have

R( f ∗)−R( fz) ≤ O(

n−25 log(1/δ)

)

with confidence at least 1− δ.Remark 7: Note that the condition of K j ∈ C∞ is sat-

isfied for the Gaussian kernel. The derived learning rate ofCorollary 1 is consistent with the convergence analysis forKMR [17], where the KMR employees the data independenthypothesis space and does not involve the sparse penalty forvariable selection. In particular, the decay rate O(n−(2)/(5)) isfaster than O(n−(1)/(7)) for RMR [15] with linear assumption.

Remark 8: Corollary 1 describes the consistency betweenR( f ∗) and R( fz). By imposing [17, Assumption 3], we canfurther obtain the convergence rate on function estimation viathe inequality fz− f ∗2

L2ρX≤ C∗(R( fz)−R( f ∗)), where C∗

is a constant.Now, we turn to the variable selection analysis following

the strategy in [54]. Without loss of generality, let J ∗ ={1, . . . , p∗} be the set of truly informative variables. Recallthat Jz defined in (16) is the selected variable set via theproposed SpMAM.

Theorem 2: Suppose that κ = sup j,u(K j (u, u))1/2 < ∞,φ�∞ <∞, and σ 2n(1)/(2)λτ j ≥ κφ�∞. Then, for SpMAMwith q = 2, there holds Jz ⊂ J ∗ for any z ∈ Zn .

Remark 9: Theorem 2 illustrates that SpMAM can identifythe truly active variables under proper parameter conditions.Indeed, the parameter conditions used here are milder than theassumptions of [54, Th. 3], where some additional restrictionsare required, e.g., Assumption A2 for probability density andAssumption A3 for τ j , 1 ≤ j ≤ p∗. On the other hand, thecurrent analysis actually extends [15, Th. 4] for the linearRMR to the nonlinear SpMAM. Moreover, it is interestingto explore the variable selection analysis by means of theincoherence assumptions (e.g., [47, Assumption 4]) to replacethe parameter conditions used here.

Remark 10: Theorems 1 and 2 are proved by developingthe error analysis technique with data-dependent hypothesisspaces [31], [32] and sparse characterization strategy [5], [54].In particular, the hypothesis error estimation is novel for

Fig. 2. Block diagram summaries theoretical results for SpMAM. The errordecomposition in Proposition 1 supports Propositions 2 and 3. Theorem 1is obtained by combining the sample error bound in Proposition 2 and thehypothesis error bound in Proposition 3. The property of nonzero coefficientis provided in Lemma 5, which is used to guarantee the variable selection inTheorem 2.

learning theory, which extends the previous results for �1

(or �2,1)-regularization scheme with the hinge loss (or theleast-squares loss) to the �q,1-regularization framework withthe nonconvex modal regression metric.

To improve the readability of theoretical analysis, we statethe outlines for the proofs of Theorems 1 and 2 in Fig. 2.

B. Error Analysis

This section establishes the generalization error boundstated in Theorem 1.

Now, we recall the inequality established in [17] to illustratethe relationship between R( f ) and Rσ ( f ).

Lemma 1: Under Assumptions 1 and 2, for any measurablefunction f : X → R, there holds

∣∣Rσ ( f ) − R( f )∣∣ ≤ c1σ

2

2 ,where c1 = p���|X∞

∫R

u2φ(u)du.It means that

R( f ∗)−R( fz) ≤ Rσ ( f ∗)−Rσ ( fz)+ c1σ2.

Hence, we further focus on bounding Rσ ( f ∗) − Rσ ( fz).Inspired from the error analysis in [15], we introduce theexpectation version of fz in (10) as the stepping-stone func-tion, which is data-independent and defined as follows:

fη = arg maxf=∑p

j=1 f j∈HK

{Rσ ( f )− η

p∑j=1

τ j f ( j)2K j

}(18)

where Rσ ( f ) is defined in (8), and η is the regularizationparameter that is not necessarily equal to λ in (14).

Proposition 1: Under Assumptions 1–3, there holds

R( f ∗)−R( fz) ≤ E1 + E2 + η f ∗2K + c1σ

2

where

E1 = Rσ ( fη)−Rσ ( fz)− (Rσz ( fη)−Rσ

z ( fz))

and

E2 = Rσz ( fz)− η fz2

K − (Rσz ( fz)− λq( fz)).

Proof: Based on Assumption 3, we know that

Rσ ( f ∗)−Rσ ( fz) = Rσ ( f ∗)− η f ∗2K −Rσ ( fz)+ η f ∗2

K .

The definitions of fη in (18) and fz in (10) imply that

Rσ ( f ∗)− η f ∗2K ≤ Rσ ( fη)− η fη2

K

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: SpMAM 7

and

Rσz ( fη)− η fη2

K ≤ Rσz ( fz)− η fz2

K .

Then, we can make a further decomposition

Rσ ( f ∗)−Rσ ( fz)

≤ Rσ ( fη)− η fη2K −Rσ ( fz)+ η f ∗2

K

≤ Rσ ( fη)−Rσz ( fη)

+{Rσ

z ( fη)− η fη2K − (Rσ

z ( fz)− η fz2K )

}

+{Rσ

z ( fz)− η fz2K −

(Rσz ( fz)− λq( fz)

)}+Rσ

z ( fz)−Rσ ( fz)+ η f ∗2K

≤ Rσ ( fη)−Rσz ( fη)+Rσ

z ( fz)−Rσ ( fz)

+{Rσ

z ( fz)− η fz2K −

(Rσz ( fz)− λq( fz)

)}

+ η f ∗2K

= E1 + E2 + η f ∗2K .

The desired result follows by combining the abovemen-tioned decomposition with Lemma 1. �

In learning theory literature, we call E1 and E2 as thesample error and the hypothesis error, respectively. The sampleerror E1, associated with fz, describes the divergence betweenthe empirical risk Rσ

z ( f ) and the expected risk Rσ ( f ). Thehypothesis error E2 characterizes the difference between theempirical regularized risks with HK and HK ,z [32], [55].

To obtain the uniform estimation of E1 over any z ∈ Zn,we prove the upper bound of fzK first.

Lemma 2: Let κ = sup j,u(K j(u, u))1/2 < ∞.For fz defined in (14), there holds fzK ≤(κn1−(1)/(q)φ∞)/(λσ min j τ j).

Proof: From the definitions of fz and its correspondingcoefficient {αz

j}, we get Rσz (0) ≤ Rσ

z ( fz) − λ∑p

j=1 τ jαzjq .

Then

λ

p∑j=1

τ jαzjq ≤ Rσ

z ( fz)−Rσz (0) ≤ φ∞

σ.

This means that∑p

j=1 αzjq ≤ (φ∞)/(λσ min j τ j). When

q = 1

fzK ≤ κ

p∑j=1

αzj1 ≤ κφ∞

λσ min j τ j. (19)

When q > 1, we have

fzK ≤ κ

p∑j=1

n∑i=1

|αzj i | ≤ κn1− 1

q

p∑j=1

αzjq ≤ κn1− 1

q φ∞λσ min j τ j

.

(20)

Combining (19) and (20), we get the desired result. �Denote

Br = { f ∈ HK : f K ≤ r}.Lemma 2 assures that ∀z ∈ Zn, fz ∈ Br with r =(κn1−(1)/(q)φ∞)/(λσ min j τ j).

Now, we introduce the empirical covering number usedin [37], [56], and [57] to measure the capacity of Br .

Definition 1: Let F be a set of measurable functions onX and x = {x1, x2, . . . , xn} ⊂ X . The �2-empirical metricfor f1, f2 ∈ F is d2,x( f1, f2) = ((1)/(n)

∑ni=1( f1(xi) −

f2(xi))2)(1)/(2). Then. the �2-empirical covering number of

F is defined as N2(F , �) = supn∈N supx N2,x(F , �)∀� > 0,where N2,x(F , �) = inf{m ∈ N : ∃{ f j }mj=1 ⊂ F , s.t., F ⊂cupm

j=1{ f ∈ F : d2,x( f, f j ) < �}}∀� > 0.Indeed, the empirical covering number of Br has been inves-

tigated extensively in learning theory literature [37]. There aresome detailed examples, e.g., [50, Th. 2], [55, Lemma 3], and[58, Examples 1 and 2].

The following concentration inequality established in [56]is used for our sample error estimation.

Lemma 3: Let G be a measurable function set on Z .Assume that there are constants B, c, a > 0 and θ ∈ [0, 1]such that g∞ ≤ B , Eg2 ≤ c(Eg)θ for each g ∈ G. If for0 < s < 2, logN2(G, �) ≤ a�−s∀� > 0, then for any δ ∈ (0, 1)and i.i.d {zi}ni=1 ⊂ Z for all g ∈ G, there holds

Eg − 1

n

n∑i=1

g(zi) ≤ γ 1−θ (Eg)θ

2+ csγ + 2

(c log 1δ

n

) 12−θ

+ 18B log 1δ

nwith confidence at least 1 − δ, where cs isa constant depending only on s and γ =max{c(2−s)/(4−2θ+sθ)(a/n)(2)/(4−2θ+sθ), B(2 − s)/(2 + s)(a/n)(2)/(2+s)}.

Proposition 2: Under Assumptions 1–3 for all δ ∈ (0, 1),there holds

E1 ≤ C1 max

{λ−

s2 σ−

3s2 n−

12+ s

2 (1− 1q )λ−

2s2+s σ−

2+5s2+s n−

22+s+ 2s

2+s (1− 1q )

}

+ 2

√2 log(1/δ)

nσ+ 18φ∞ log(1/δ)

σnwith confidence at least 1− δ, where C1 is a positive constantindependently of σ, n, and δ.

Proof: Consider a function-based random variable set

G ={

g(z) := g f (z) = 1

σ

(y − fη(x)

σ

)

−φ

(y − f (x)

σ

)): f ∈ Br

}

where fη is defined in (10) and r =(κn1−(1)/(q)φ∞)/(λσ min j τ j). For any f1, f2 ∈ Br

and z = (x, y) ∈ Z , we get

|g f1(z)− g f2(z)| ≤φ�∞

σ 2| f1(x)− f2(x)|.

Then

logN2(G, �) ≤ logN2

(Br ,

�σ 2

φ�∞)≤ logN2

(B1,

�σ 2

rφ�∞)

≤ cs p1+sr sσ−2s�−s

where the value of s is given in (17) and the last inequalityfrom the covering number bounds for HK j with K j ∈ Cv (see[50, Th. 2] or [55, Lemma 3]).

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

The boundedness of φ in Assumption 1 implies that g∞ ≤(φ∞)/(σ ). Meanwhile, we deduce that

Eg2 ≤ φ∞σ

∣∣Rσ ( fη)−Rσ ( f )∣∣

≤ φ∞σ

∣∣R( fη)−R( f )+ c1σ2∣∣

≤ φ∞σ

(pE fη(0)− pE f (0)+ c1σ

2)

≤ (c2σ−1 + c3σ)(Eg)0

where c2 = φ∞(pE fη(0) − pE f (0)) and c3 = c1φ∞ are

two bounded constants.Then, Lemma 3 holds for the function set G with B =

(φ∞)/(σ ), a = cs p1+sr sσ−2s , θ = 0, and c = c2σ−1+ c3σ .

Thus, for any g ∈ G and 0 < δ < 1, we have with confidenceat least 1− δ

Rσ ( fη)−Rσ ( f )− (Rσz ( fη)−Rσ

z ( f ))

≤ 1

2max

{√cs p1+sr sσ−2s

n,

(φ∞σ

)2−s2+s (

cs p1+sr sσ−2s

n

) 22+s

}

+ 2

√(σ + σ−1) log(1/δ)

n+ 18φ∞ log(1/δ)

σn.

Recall that, for any z ∈ Zn , fz ∈ Br with r =(κn1−(1)/(q)φ∞)/(λσ min j τ j). Putting this selection of rinto the above inequality, we obtain the desired bound of thesample error E1. �

To bound the hypothesis error E2, we first illustrate a keyproperty for the coefficients of fz.

Lemma 4: For fz defined in (10), there holds

τ jαzjq ≤ φ�∞

ησ 2n1− 1q

∀ j ∈ {1, 2, . . . , p}.

Proof: Based on the representation of fz

in (11), we get fz(xi) = ∑pj=1 KT

jiαzj , where

K j i = (K j (x1 j , xi j), . . . , K j (xnj , xi j))T ∈ R

n andαz

j = (αzj1, . . . , α

zjn)

T ∈ Rn . From (10), we deduce

that

ητ j(αzj )

T K j + 1

nσ 2

n∑i=1

φ�(

yi − fz(xi)

σ

)KT

ji = 0

where K j = (K j (xs j, xt j ))ns,t ∈ R

n×n. By direct computation,we further get

ητ j(αzj )

T K j = − 1

nσ 2

(φ�

(y1 − fz(x1)

σ

), . . . ,

φ�(

yn − fz(xn)

σ

))K j .

According to the positive definite of K j , we obtain that

τ jαzj = −

1

nησ 2

(φ�

(y1 − fz(x1)

σ

), . . . , φ�

(yn − fz(xn)

σ

))T

.

It follows that for any j ∈ {1, . . . , p}:

τ jαzj1 = 1

nησ 2

n∑t=1

∣∣∣∣φ�(

yi − fz(xi)

σ

)∣∣∣∣ ≤ φ�∞

ησ 2

and when q > 1

τ jαzjq = 1

nησ 2

( n∑t=1

[φ�

(yi − fz(xi)

σ

)]q) 1q

≤ φ�∞n1− 1

q ησ 2.

This completes the proof. �Proposition 3: The hypothesis error E2 defined in Proposi-

tion 1 satisfies E2 ≤ (λφ�∞)/(ησ 2n1−(1)/(q)).Proof: From the definition of fz in (14), we know that

Rσz ( fz)− λq( fz) ≤ Rσ

z ( fz)− λq( fz).

Then

E2 = Rσz ( fz)− λq( fz)− (Rσ

z ( fz)− λq( fz))

+λq( fz)− η

p∑j=1

τ j f ( j)z 2

K j

≤ λq( fz)− η

p∑j=1

τ j f ( j)z 2

K j≤ λq( fz).

Combining the abovementioned inequality with Lemma 4,we get that

E2 ≤ λq( fz) = λτ jα( j)z,ηq ≤ λφ�∞

ησ 2n1− 1q

.

�Now, we state the proof of Theorem 1 as follows.Proof of Theorem 1: Combining Propositions 1–3, we get

with confidence at least 1− δ

R( f ∗)−R( fz)

≤ C1

(max

{λ−

s2 σ−

32 n−

12+ s

2 (1− 1q ), λ−

2s2+s σ−

2+5s2+s n−

22+s+ 2s

2+s (1− 1q )

}

+ n−12 σ−

12 + λη−1σ−2n−1− 1

q + η + σ 2

)log(1/δ)

where C1 is positive constants independently of n, σ, λ, andη.

Setting n−(1)/(2)σ−(1)/(2) = σ 2 and η =λη−1σ−2n−(1−(1)/(q)), we get σ = n−(1)/(5), η =λ(1)/(2)n−(3)/(10)+(1)/(2q) and further prove that

R( f ∗)−R( fz)

≤ C2

(max

{λ−

s2 n

3s−5+ s2 (1− 1

q )

10 , λ−2s

2+s n5s−8+10s(1− 1

q )

10+5s

}

+ λ12 n−

310+ 1

2q + n−25

)log(1/δ)

with confidence at least 1−δ, where C2 is a positive constant.Finally, we get the desired result by taking λ = n−ζ with

5− 5sζ − 3s− (s)/(2)(1− (1)/(q)) ≥ 0 and 8− 5s− 10sζ −10s(1− (1)/(q)) > 0.

C. Variable Selection Analysis

To prove Theorem 2, we need illustrate the property of {αzj}

defined in (15).Lemma 5: For {αz

j } defined in (15) and q = 2, there holds∥∥∥∥ 1

nσ 2

n∑i=1

φ�(

yi −∑pj=1 KT

jiαzj

σ

)K j i

∥∥∥∥2

= λτ j

when some j ∈ {1, . . . , p} satisfies αzj2 �= 0.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: SpMAM 9

Proof: From the definition of {αzj}, we know it is the

maximizer of the coefficient-based function

C(α) = 1

n∑i=1

φ

(yi −∑p

j=1 KTjiα j

σ

)− λ

p∑j=1

τ jα j2.

Hence, for j ∈ {1, . . . , p} satisfying αzj2 �= 0, there holds

∂(C(α))

∂(α j )

∣∣∣α=αz= − 1

nσ 2

n∑i=1

φ�(

yi −∑pj=1 KT

jiαzj

σ

)(K j i)

T

− λτ j

(αzj )

T

αzj2= 0.

Furthermore, it means that

− 1

nσ 2

n∑i=1

φ�(

yi −∑pj=1 KT

jiαzj

σ

)K j i = λτ j

αzj

αzj2

.

By taking �2-norm on the both sides, we get the desired result.�

Lemma 5 demonstrates the necessary condition for the nonzeroαz

j , which will be used as the stepping-stone for variableselection analysis.

Following the analysis techniques in [15] and [54],we establish the variable selection guarantee for SpMAM.

Proof of Theorem 2: Suppose that αzj2 �= 0 for some

j > p∗. Then, we can deduce that∥∥∥∥ 1

nσ 2

n∑i=1

φ�(

yi −∑pj=1 KT

jiαzj

σ

)K j i

∥∥∥∥2

≤ 1

nσ 2

n∑i=1

∥∥∥∥φ�(

yi −∑pj=1 KT

jiαzj

σ

)K j i

∥∥∥∥2

≤ φ�∞

nσ 2

n∑i=1

K j i2

≤ κφ�∞√n

σ 2.

The abovementioned equality together with Lemma 5 guaran-tees that λτ j ≤ (κφ�∞√n)/(σ 2). This inequality contradictswith the parameter condition σ 2n−(1)/(2)λτ j ≥ κφ�∞. There-fore, we have αz

j2 = 0 for any j > p∗.

IV. OPTIMIZATION ALGORITHM

Notice that the �1-norm regularizer and the �2,1-norm regu-larizer have been used extensively and successfully for sparselearning models (e.g., Lasso in [33] and SpAM [3]) and groupsparse algorithms (e.g., Group Lasso [51] and GroupSpAM[12]), respectively. Hence, we just consider the computingalgorithm of SpMAM with q = 1, 2 in this section.

The general optimization algorithm, to get fz in (14) andJz in (16), is stated by following two steps.

A. Step 1: Estimating αzj via HQ Optimization

We first convert the objective function (15) into a weightedleast-squares problem by the HQ optimization [34] and thenimplement the transformed problem via the ADMM [35].

Indeed, the computing algorithm of SpMAM can beobtained directly by the optimization strategy in [29] and [52]for the Gaussian kernel-based modal representation and in [15]for the Epanechnikov kernel-based modal representation. Forcompleteness, we further offer the optimization steps of ourSpMAM associated with sigmoid kernel.

From the convex optimization theory in [59], there exists aconvex function f (a) with conjugate function g(b) such thatf (a) = maxb(ab − g(b)). It has been proven in [15] thatarg maxb(ab − g(b)) = f �(a)∀a ∈ R. We take an additionalconvex function

f (a) = 2

π(exp(√

a)+ exp(−√a)), a > 0

to illustrate the objective function associated with sigmoidkernel. For the representation of sigmoid kernel φ(t) =

2π exp (t)+π exp (−t) , there holds

φ(t) = f (t2) = maxb

(t2b − g(b)), t ∈ R. (21)

Substituting (21) into the objective function (15), we deducethat the augmented objective function equals to

Q(α, b)

= maxα∈Rnp,b∈Rn

{1

n∑i=1

(bi

(yi −∑p

j=1 KTjiα j

σ

)2

− g(bi)

)

− λ

p∑j=1

τ jα jq

}

where b = (b1, . . . , bn)T ∈ R

n is the auxiliary vector, whichcan be updated by

bi = f �((

yi −∑pj=1 KT

jiα j

σ

)2)

= − exp (li)− exp (−li )

πli (exp (li )+ exp (−li))2

where li = (yi −∑pj=1 KT

jiα j )/(σ ) for a fixed α. When b issettled down, update α via

arg maxα∈Rnp

{1

n∑i=1

bi

(yi −

p∑j=1

KTjiα j

)2

− λ

p∑j=1

τ jα jq

}.

(22)

Clearly, the estimation (22) can be converted into aleast-squares problem. As a result, an alternating maxi-mization framework with regard to α and weight b canbe formed, which satisfies Q(αt , bt) ≤ Q(αt , bt+1) ≤Q(αt+1, bt+1). Here, t denotes the t th iteration, and thesequence {Q(αt , bt), t = 1, 2, . . .} finally converges [29], [34].

We next consider to update α. Denote K = (K1, · · · , Kp) ∈R

n×np and Y = (y1, . . . , yn)T ∈ R

n . The problem (22) can bereformulated as

arg minα∈Rnp

(Y−Kα)T diag

(− b

σ

)(Y−Kα)+ λ

p∑j=1

τ jα jq

(23)

where diag(·) denotes an operator that converts the vector to adiagonal matrix. Denote β j = (β j1, β j2, . . . , β jn)

T ∈ Rn and

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

β = (βT1 , . . . , βT

p )T ∈ Rnp as an auxiliary vector. Then, (23)

can be rewritten as

arg minα,β∈Rnp

(Y−Kα)T diag

(− b

σ

)(Y−Kα)+ λ

p∑j=1

τ jβ jq

s.t. α − β = 0.

The scale augmented Lagrangian function [35] is

L(α, β,μ)=(Y−Kα)T diag

(−b

σ

)(Y−Kα)+λ

p∑j=1

τ jβ jq

+ �

2α − β + μ2

2 −�

2μ2

2

where μ is Lagrange multiplier and � > 0 is a penaltyparameter. This problem can be solved by the followingiterative scheme:

αt+1 = arg minα∈Rnp

{(Y−Kα)T diag

(− b

σ

)(Y−Kα)

+ �

2

∥∥α − β t + μt∥∥2

2

}(24)

β t+1 = arg minβ∈Rnp

1

2

∥∥αt+1 − β + μt∥∥2

2 +λ

p∑j=1

τ jβ jq (25)

μt+1 = μt + αt+1 − β t+1. (26)

For α-update (24) with fixed β t and μt , the problemis essentially a weighted ridge regression, i.e., quadraticallyregularized least squares. By direct calculation, we have

αt+1 =(

2KT diag

(− b

σ

)K + �I

)−1

×(

2KT diag

(− b

σ

)Y+ �(β t − μt)

). (27)

For β-update (25) with fixed αt and μt , the problem isequivalent to p subproblems

arg minβ j∈Rn

1

2

∥∥αtj − β j + μt

j

∥∥22 +

λτ j

�β jq ∀ j ∈ p. (28)

These subproblems can be solved by the soft thresholdingoperator [35], [59], that is

β t+1j = Sλτ j /�

(αt

j + μtj

) ∀ j ∈ p. (29)

For q = 1, the soft thresholding operator S is defined as

Sk(a) = (a − k)+ − (−a − k)+. (30)

For q = 2, the soft thresholding operator S is

Sk(a) = (1− k/a2)+a. (31)

Given αt+1 and β t+1, we can update the Lagrange multiplierμt+1 by (26). If the (t + 1)th iteration satisfies

αt+1 − β t+1∞ < ε and αt+1 − αt∞ < ε (32)

we return αt+1 as the final result.

B. Step 2: Screening Out Active Variables

Here, we adopt the stability-based selection strategy in [53]to get a stable result for estimation and variable selection.

First, the training samples are randomly divided into twosubsets. For a given constant vn , two identified active variablesets Jz,1k and Jz,2k are obtained for each subgroup with thekth splitting.

Then, the threshold vn is derived by maximizing the quantity

1

T

T∑k=1

κ(Jz,1k,Jz,2k) (33)

where t is the number of partition in the data-adaptive scheme,and κ(·, ·) is the Cohen kappa statistical measure between twoactive sets [53].

The computing steps of SpMAM are summarized in Algo-rithm 1.

Algorithm 1 Optimization Algorithm of SpMAMInput: Samples z, φ, σ , λ, h, K j , τ j , j = 1, . . . , p, ε,Maxiter;Initialization: t = 0, αt ∈ R

np from uniform distributionU(0, 1);While not converged and t ≤ Maxiter;

1. Fix αt , update weight bt+1 = (bt+11 , . . . , bt+1

n )T viabt+1

i = − exp (li )−exp (−li )π li (exp (li )+exp (−li ))2 , where li =

yi−∑pj=1 KT

j iαtj

σ, 1 ≤ i ≤ n;

2. Fix bt+1, update coefficients αt+1 using ADMM:initialization: initialize t∗ = 0, αt∗ = 0, β t∗ = 0,

μt∗ = 0 � = 10−1;while not converge and t∗ ≤ Maxiter2.1. Fixed β t∗ and μt∗ , update αt∗+1 by equation (27);

2.2. Fixed αt∗+1 and μt∗ , update β t∗+1 by equation (29)with

soft thresholding operator (30) for q = 1 or (31) forq = 2;

2.3. Fixed αt∗+1 and β t∗+1, update μt∗+1 byequation (26);

2.4. Check the convergence condition (32);2.5. t∗ ← t∗ + 1;

end while3. αt+1 = αt∗ ;4. Check the convergence conditions: Q(αt+1, bt+1) −

Q(αt , bt) < ε5. t ← t + 1;

End Whileαz

j = αtj , j = 1, . . . , p;

Variable Selection: The identified index set Jz = { j :αz

jq > vn},where vn is searched by maximizing (33);Output: Jz and fz associated with αz

j , j ∈ Jz.

V. EMPIRICAL ASSESSMENT

This section evaluates the empirical performanceof SpMAM with different modal kernels (Gaussiankernel Kσ (t)Gau = e−t2/2σ 2

and sigmoid kernelKσ (t)Sig = (2)/(π(et/σ + e−t/σ ))) and regularizers (�1-norm

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: SpMAM 11

and �2,1-norm regularizer). We refer to these methods asSpMAMGau,q and SpMAMSig,q associated with differentmodal kernels and regularizers. In all cases, the Gaussiankernel Kh(x, t) = e−x−t2

2/2h2is employed to construct the

data-dependent hypothesis space HK ,z, and j th weight is setto τ j = 1 for each kernel norm f j2

K j, j = 1, . . . , p. For

comparison, we consider three methods: SpAM [3], RMR[15], and Lasso [33].

A. Evaluation Metrics and Parameter Tuning

For the synthetic data, the average squared error (ASE) isused to evaluate the prediction results, which is defined asASE = (1)/(n)

∑ni=1( f ∗(xi) − fz(xi))

2. It should be noticedthat the error measure indicates the truly divergence betweenf ∗(xi) and fz(xi), which is independently of the data noiseand different from the empirical risk, i.e., 1

n

∑ni=1(yi− fz(xi))

2.Six kinds of metrics are used to measure the behavior of

variable selection: size (the average number of selected vari-ables), TP (the number of selected truly informative variables),FP (the number of selected truly uninformative variables), andC, U, and O (the probability of correct fitting, underfitting, andoverfitting, respectively).

For each synthetic experiment, we use the holdout methodto split the data set into three parts with the same samplesize: a training set, a validation set, and a test set. Forgiven training set, we train model and search the thresholdvn in {10(−2+0.05t)}60

t=1 based on the stability-based selectioncriterion [53]. The validation set selects optimal regularizationparameter λ, bandwidth h, and bandwidth σ in the grids{0.05t}200

t=1, {10−3, 5× 10−3, 10−2, . . . , 1, 5, 10} and {0.5t}20t=1,

respectively. According to abovementioned optimal hyperpa-rameters, we finally evaluate the prediction results on test set.

For the real-world data, we randomly take 10% of the dataset as test set and the remaining as training set. We employfivefold cross-validation on the training set to select thehyperparameters and evaluate the performance over test set.Since the true active features are unknown, we just measurethe prediction error by the relative sum of the squared error(RSSE), which is defined as

RSSE =∑

x,y∈X test

(y − fz(x))2/∑

y∈X test

(y − Ey)2

where fz is the estimator of SpMAM, y is the output of x inthe test set X test, and Ey means the average value of outputsgiven x ∈ X test.

B. Synthetic Data Evaluation

Following the simulations in [3] and [12], two examples(called Example A and Example B) are constructed. Thetraining set and the validation set are independently generatedby yi = f ∗(xi) + �i/2, and the test set is generated byyi = f ∗(xi) without noise added, where p-dimension inputxi = (xi1, . . . , xip)

T , i = 1, . . . , n, is independently drawnfrom a uniform distribution U(a, b). In Example A, theuniform distribution is U(−1, 1), and f ∗(x) is an additivefunction formed by eight component functions. In ExampleB, the uniform distribution is U(−0.5, 0.5), and f ∗(x) is

TABLE III

ADDITIVE FUNCTIONS IN EXAMPLE A AND EXAMPLE B

constructed by four component functions. The random noise�i involved the Gaussian noise, the chi-square noise, and thestudent noise. Data details are presented in Table III.

For each simulation, we consider scenarios with (n, p) =(200, 100), (200, 200), (200, 400), respectively. Each scenariois repeated for 100 times to reduce the influence of ran-dom samples, and the averaged results are summarized inTable IV. In Section I.A in the Supplementary Material,we also investigated the impact of sample sizes on ASE and thecorrect fitting. These empirical results show that the proposedSpMAM has the lowest prediction error and the best variableselection ability.

Moreover, we introduce the sample coverage probabilityto measure the algorithmic robustness, which respects to thesample ratio in the regression intervals centered around theregression function among the whole data [15], [19]. Threewidths of symmetric intervals, 0.1v, 0.2v, and 0.3v, areconsidered, where v denotes the standard error of randomnoise. For brevity, we just show the average results over 100times in Table V for (n, p) = (200, 100) and SpMAM with theGaussian modal kernel. For non-Gaussian noise, the coverageprobabilities of SpMAM are higher than that of their coun-terparts, which demonstrates that our prediction curves lie inthe high-density regions. Besides continuous noise employedearlier, our SpMAM also enjoys robust performance for pointpollution, e.g., the noise is continuous-discrete mixture dis-tribution. For the space limitation, we state some empiricalevaluations for this case in Section I.B in the SupplementaryMaterial. In summary, experimental results show that SpMAMconsistently outperforms other methods in both Example Aand Example B with respect to prediction error, variableselection ability, and robustness.

C. Real World Data Evaluation

Now, evaluate the proposed SpMAM and related methodson Boston Housing data in UCI, Plasma Retinol data in

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE IV

AVERAGED PERFORMANCE ON SYNTHETIC DATA IN EXAMPLE A (LEFT) AND EXAMPLE B (RIGHT) UNDER THREE TYPES OF NOISES

StatLib, and Ozone data in CRAN. These data have been usedto evaluate variable selection methods in [3] and [54]. Due tothe limited features in real data, 20 irrelative variables areadded to each data set, which are generated from the uniformdistribution U(−0.5, 0.5).

Boston Housing data contain 506 observations with 13variables, including CRIM, ZN, INDUS, CHAS, NOX, RM,

AGE, DIS, RAD, TAX, PTRATIO, B, and LSTAR. PlasmaRetinol data are collected to establish a relationship betweenthe plasma concentration of micronutrients and some personalcharacteristics. It contains 315 observations with one outputvariable RETPLASMA and 12 input variables, including AGE,SEX, SMOK, QUET, VIT, CAL, FAT, FIBER, ALCOHOL,CHOLES, BETA, and RET. Ozone data aim to predict the

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: SpMAM 13

TABLE V

AVERAGE COVERAGE PROBABILITIES ON SIMULATED DATA (n, p) = (200, 100)

TABLE VI

RESULTS OF VARIABLE SELECTION AND RSSE (STD) ON BOSTON HOUSING, PLASMA RETINOL, AND OZONE DATA

upland ozone concentration (UPO3), which is influenced byDM, DW, M, VDHT, WDSP, HMDT, SBTP, IBTP, IBHT,DGPG, and VSTY.

We present the results of variable selection, RSSE, and itsstandard error in Table VI. For the Boston Housing data,the proposed SpMAM usually just selects three variablesbut can achieve a smaller RSSE than other approaches. Forthe Plasma Retinol data, the selected variables by the fourmethods are different except that the variable AGE is selectedby all. In particular, ALCOHOL containing one leveragepoint is selected by SpAM, SpMAMGau,1, SpMAMGau,2, andSpMAMSig,2. However, SpAM obtains a bigger predictionerror than our method. For the Ozone data, M, SBTP, and IBTPare the most important variables pronounced in [60], which canbe selected by the abovementioned methods except Lasso andRMR. Moreover, HMDT is ignored by SpMAM but chosen

by all competitors. Based on the competitive performance ofSpMAM, it indicates that HMDT may not be a key factor forprediction.

In order to verify that our SpMAM still works forhigh-dimensional data sets, we repeat the abovementionedexperiments by replacing 20 irrelative variables with 500irrelative variables and present the average results in Section IIin the Supplementary Material due to space limitation.

VI. CONCLUSION

This article formulated a new sparse additive regressionmodel by integrating the issues of modal regression, sparserepresentation, and data-dependent hypothesis space together.The proposed model has the robustness to the non-Gaussiannoise and the interpretability to prediction results. In theory,

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

we establish its asymptotic guarantees on regression estimationand variable selection, where the derived learning rate hereis faster than that of linear RMR [15]. In applications, ourmodel has shown the advanced empirical performance on bothsimulated and real-world data sets.

ACKNOWLEDGMENT

This work was done in part when Hong Chen visited FengZheng Lab, SUSTech.

REFERENCES

[1] C. J. Stone, “Additive regression and other nonparametric models,” Ann.Statist., vol. 13, no. 2, pp. 689–705, Jun. 1985.

[2] T. J. Hastie and R. J. Tibshirani, Generalized Additive Models. London,U.K.: Chapman & Hall, 1990.

[3] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman, “SpAM: Sparseadditive models,” J. Roy. Stat. Soc. B, vol. 71, pp. 1009–1030, 2009.

[4] T. Zhao and H. Liu, “Sparse additive machine,” in Proc. Int. Conf. Artif.Intell. Statist. (AISTATS), 2012, pp. 1435–1443.

[5] H. Chen, X. Wang, C. Deng, and H. Huang, “Group sparse addi-tive machine,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017,pp. 198–208.

[6] K. Kandasamy and Y. Yu, “Additive approximations in high dimensionalnonparametric regression via the SALSA,” in Proc. Int. Conf. Mach.Learn. (ICML), 2016, pp. 69–78.

[7] M. Kohler and A. Krzyzak, “Nonparametric regression based on hier-archical interaction models,” IEEE Trans. Inf. Theory, vol. 63, no. 3,pp. 1620–1630, Mar. 2017.

[8] G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax-optimal rates forsparse additive models over kernel classes via convex programming,”J. Mach. Learn. Res., vol. 13, no. 1, pp. 389–427, 2012.

[9] A. Christmann and D.-X. Zhou, “Learning rates for the risk of kernel-based quantile regression estimators in additive models,” Anal. Appl.,vol. 14, no. 3, pp. 449–477, May 2016.

[10] L. Meier, S. van de Geer, and P. Bühlmann, “High-dimensional additivemodeling,” Ann. Statist., vol. 37, no. 6B, pp. 3779–3821, Dec. 2009.

[11] J. Huang, J. L. Horowitz, and F. Wei, “Variable selection in nonpara-metric additive models,” Ann. Statist., vol. 38, no. 4, pp. 2282–2313,Aug. 2010.

[12] J. Yin, X. Chen, and E. P. Xing, “Group sparse additive models,” inProc. Int. Conf. Mach. Learn. (ICML), 2012, p. 871.

[13] Y. Lin and H. H. Zhang, “Component selection and smoothing inmultivariate nonparametric regression,” Ann. Statist., vol. 34, no. 5,pp. 2272–2297, Oct. 2006.

[14] M. Yuan and D.-X. Zhou, “Minimax optimal rates of estimationin high dimensional additive models,” Ann. Statist., vol. 44, no. 6,pp. 2564–2593, Dec. 2016.

[15] X. Wang, H. Chen, W. Cai, D. Shen, and H. Huang, “Regularizedmodal regression with applications in cognitive impairment prediction,”in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017, pp. 1448–1458.

[16] W. Yao and L. Li, “A new regression model: Modal linear regression,”Scandin. J. Statist., vol. 41, no. 3, pp. 656–671, Sep. 2014.

[17] Y. Feng, J. Fan, and J. A. K. Suykens, “A statistical learning approachto modal regression,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 1–35,2020.

[18] G. Collomb, W. Härdle, and S. Hassani, “A note on prediction viaestimation of the conditional mode function,” J. Stat. Planning Inference,vol. 15, pp. 227–236, Jan. 1986.

[19] W. Yao, B. G. Lindsay, and R. Li, “Local modal regression,” J. Non-parametric Statist., vol. 24, no. 3, pp. 647–663, Sep. 2012.

[20] Y.-C. Chen, C. R. Genovese, R. J. Tibshirani, and L. Wasserman, “Non-parametric modal regression,” Ann. Statist., vol. 44, no. 2, pp. 489–514,Apr. 2016.

[21] M.-J. Lee, “Mode regression,” J. Econometrics, vol. 42, no. 3,pp. 337–349, Nov. 1989.

[22] T. W. Sager and R. A. Thisted, “Maximum likelihood estimation ofisotonic modal regression,” Ann. Statist., vol. 10, no. 3, pp. 690–707,Sep. 1982.

[23] C. Abraham, G. Biau, and B. Cadre, “Simple estimation of the modeof a multivariate density,” Can. J. Statist., vol. 31, no. 1, pp. 23–34,Mar. 2003.

[24] E. Matzner-Løfber, A. Gannoun, and J. G. De Gooijer, “Nonparametricforecasting: A comparison of three kernel-based methods,” Commun.Statist. Theory Methods, vol. 27, no. 7, pp. 1593–1617, Jan. 1998.

[25] J. Einbeck and G. Tutz, “Modelling beyond regression functions:An application of multimodal regression to speed–flow data,”J. Roy. Stat. Soc. C, Appl. Statist., vol. 55, no. 4, pp. 461–475,Aug. 2006.

[26] J. Li, S. Ray, and B. G. Lindsay, “A nonparametric statistical approachto clustering via mode identification,” J. Mach. Learn. Res., vol. 8,pp. 1687–1723, Aug. 2007.

[27] W. Liu, P. P. Pokharel, and J. C. Príncipe, “Correntropy: Propertiesand applications in non-Gaussian signal processing,” IEEE Trans. SignalProcess., vol. 55, no. 11, pp. 5286–5298, Nov. 2007.

[28] Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. K. Suykens, “Learningwith the maximum correntropy criterion induced losses for regression,”J. Mach. Learn. Res., vol. 16, pp. 993–1034, 2015.

[29] R. He, W.-S. Zheng, and B.-G. Hu, “Maximum correntropy criterionfor robust face recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 33, no. 8, pp. 1561–1576, Aug. 2011.

[30] V. Roth, “The generalized LASSO,” IEEE Trans. Neural Netw., vol. 15,no. 1, pp. 16–28, Jan. 2004.

[31] Q. Wu and D.-X. Zhou, “Learning with sample dependent hypoth-esis spaces,” Comput. Math. Appl., vol. 56, no. 11, pp. 2896–2907,Dec. 2008.

[32] H. Chen, H. Xia, W. Cai, and H. Huang, “Error analysis of generalizedNyström kernel regression,” in Proc. Adv. Neural Inf. Process. Syst.(NIPS), 2016, pp. 2541–2549.

[33] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy.Stat. Soc. B, Methodol., vol. 58, no. 1, pp. 267–288, Jan. 1996.

[34] M. Nikolova and M. K. Ng, “Analysis of half-quadratic minimizationmethods for signal and image recovery,” SIAM J. Sci. Comput., vol. 27,no. 3, pp. 937–966, Jan. 2005.

[35] S. Boyd, “Distributed optimization and statistical learning via the alter-nating direction method of multipliers,” Found. Trends Mach. Learn.,vol. 3, no. 1, pp. 1–122, 2010.

[36] F. Cucker and D. X. Zhou, Learning Theory: An Approximation TheoryViewpoint. Cambridge, U.K.: Cambridge Univ. Press, 2007.

[37] I. Steinwart and A. Christmann, Support Vector Machine. New York,NY, USA: Springer-Verlag, 2008.

[38] Y. Chen, “A tutorial on kernel density estimation and recent advances,”Biostatist. Epidemiol., vol. 1, no. 1, pp. 161–187, 2017.

[39] T. W. Sager, “Estimation of a multivariate mode,” Ann. Statist., vol. 6,no. 4, pp. 802–812, Jul. 1978.

[40] Y.-C. Chen, “Modal regression using kernel density estimation:A review,” Wiley Interdiscipl. Rev. Comput. Statist., vol. 10, no. 4,p. e1431, Jul. 2018.

[41] H. Chernoff, “Estimation of the mode,” Ann. Inst. Stat. Math., vol. 16,no. 1, pp. 31–41, 1964.

[42] T. Dalenius, “The mode—A neglected statistical parameter,” J. Roy. Stat.Soc., vol. 128, no. 1, pp. 110–117, 1965.

[43] J. H. Venter, “On estimation of the mode,” Ann. Math. Statist., vol. 38,no. 5, pp. 1446–1455, 1967.

[44] C. Heinrich, “The mode functional is not elicitable,” Biometrika,vol. 101, no. 1, pp. 245–251, Mar. 2014.

[45] H. Chen and Y. Wang, “Kernel-based sparse regression with thecorrentropy-induced loss,” Appl. Comput. Harmon. Anal., vol. 44, no. 1,pp. 144–164, Jan. 2018.

[46] P. Huber and E. Ronchetti, Robust Statistics. Hoboken, NJ, USA: Wiley,2009.

[47] S. Lv, H. Lin, H. Lian, and J. Huang, “Oracle inequalities for sparseadditive quantile regression in reproducing kernel Hilbert space,” Ann.Statist., vol. 46, no. 2, pp. 781–813, Apr. 2018.

[48] G. Wahba, Spline Models for Observational Data. Philadelphia, PA,USA: SIAM, 1990.

[49] B. Schlköpf and A. J. Smola, Learning With Kernels. Cambridge, MA,USA: MIT Press, 2002.

[50] L. Shi, Y.-L. Feng, and D.-X. Zhou, “Concentration estimates forlearning with �1-regularizer and data dependent hypothesis spaces,”Appl. Comput. Harmon. Anal., vol. 31, no. 2, pp. 286–302, 2011.

[51] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,” J. Roy. Stat. Soc. B, Stat. Methodol., vol. 68, no. 1,pp. 49–67, Feb. 2006.

[52] Y. Wang, Y. Y. Tang, and L. Li, “Correntropy matching pursuit withapplication to robust digit and face recognition,” IEEE Trans. Cybern.,vol. 47, no. 6, pp. 1354–1366, Jun. 2017.

[53] W. Sun, J. Wang, and Y. Fang, “Consistent selection of tuning parametersvia variable selection stability,” J. Mach. Learn. Res., vol. 14, no. 9,pp. 3419–3440, 2013.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: SpMAM 15

[54] L. Yang, S. Lv, and J. Wang, “Model-free variable selection in repro-ducing kernel Hilbert space,” J. Mach. Learn. Res., vol. 17, pp. 1–24,May 2016.

[55] L. Shi, “Learning theory estimates for coefficient-based regularizedregression,” Appl. Comput. Harmon. Anal., vol. 34, no. 2, pp. 252–265,Mar. 2013.

[56] Q. Wu, Y. Ying, and D.-X. Zhou, “Multi-kernel regularized classifiers,”J. Complex., vol. 23, no. 1, pp. 108–134, Feb. 2007.

[57] B. Zou, C. Xu, Y. Lu, Y. Y. Tang, J. Xu, and X. You, “k-Times Markovsampling for SVMC,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29,no. 4, pp. 1328–1341, Apr. 2018.

[58] Z.-C. Guo and D.-X. Zhou, “Concentration estimates for learning withunbounded sampling,” Adv. Comput. Math., vol. 38, no. 1, pp. 207–223,Jan. 2013.

[59] R. T. Rockafellar, Convex Analysis. Princeton, NJ, USA: Princeton Univ.Press, 1997.

[60] G. E. Fasshauer, F. J. Hickernell, and H. Wozniakowski, “On dimension-independent rates of convergence for function approximation withGaussian kernels,” SIAM J. Numer. Anal., vol. 50, no. 1, pp. 247–271,Jan. 2012.

Hong Chen received the B.S., M.S., and Ph.D.degrees from Hubei University, Wuhan, China,in 2003, 2006, and 2009, respectively.

From February 2016 to August 2017, he was aPost-Doctoral Researcher with the Department ofComputer Science and Engineering, The Universityof Texas at Arlington, Arlington, TX, USA. He iscurrently a Professor with the Department of Mathe-matics and Statistics, College of Science, HuazhongAgricultural University, Wuhan. His current researchinterests include machine learning, statistical learn-ing theory, and approximation theory.

Yingjie Wang received the B.S. degree fromHuazhong Agricultural University, Wuhan, China,in 2017, where he is currently pursuing the Ph.D.degree with the Department of Agricultural Infor-mation Engineering, College of Informatics.

His current research interests include interpretablemachine learning and learning theory.

Feng Zheng (Member, IEEE) received the Ph.D.degree from The University of Sheffield, Sheffield,U.K., in 2017.

He is currently an Assistant Professor with theDepartment of Computer Science and Engineer-ing, Southern University of Science and Technol-ogy, Shenzhen, China. His research interests includemachine learning, computer vision, and human–computer interaction.

Cheng Deng (Senior Member, IEEE) received theB.E., M.S., and Ph.D. degrees in signal and infor-mation processing from Xidian University, Xi’an,China, in 2001, 2006, and 2009, respectively.

He is currently a Full Professor with the Schoolof Electronic Engineering, Xidian University. He isthe author or a coauthor of more than 100 scientificarticles at top venues, including the IEEE TNNLS,the IEEE TRANSACTIONS ON IMAGE PROCESS-ING (TIP), the IEEE TRANSACTIONS ON CYBER-NETICS (TCYB), the IEEE TRANSACTIONS ON

MULTIMEDIA (TMM), the IEEE TRANSACTIONS ON SYSTEMS, MAN, ANDCYBERNETICS: SYSTEMS (TSMC), the International Conference on Com-puter Vision (ICCV), the Conference on Computer Vision and Pattern Recog-nition (CVPR), the International Conference on Machine Learning (ICML),the Conference on Neural Information Processing Systems (NIPS), the Inter-national Joint Conference on Artificial Intelligence (IJCAI), and the Asso-ciation for the Advancement of Artificial Intelligence (AAAI). His researchinterests include computer vision, pattern recognition, and information hiding.

Heng Huang received the B.S. and M.S. degreesfrom Shanghai Jiao Tong University, Shanghai,China, in 1997 and 2001, respectively, and the Ph.D.degree in computer science from Dartmouth College,Hanover, NH, USA, in 2006.

He is currently a John A. Jurenko Endowed Pro-fessor of computer engineering with the Electricaland Computer Engineering Department, Universityof Pittsburgh, Pittsburgh, PA, USA. He is alsoa Senior Consulting Researcher with JD FinanceAmerica Corporation, Mountain View, CA, USA.

His research interests include machine learning, big data mining, bioinfor-matics, and neuroinformatics.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on July 25,2020 at 01:05:13 UTC from IEEE Xplore. Restrictions apply.


Recommended