Learning Data-adaptive Nonparametric Kernels · 2020-03-31 · Learning Data-adaptive Nonparametric...

LEARNING DATA-ADAPTIVE NONPARAMETRIC KERNELS

Learning Data-adaptive Nonparametric Kernels

Fanghui Liu [email protected] of Image Processing and Pattern Recognition, Shanghai Jiao Tong UniversityShanghai, 200240, ChinaDepartment of Electrical Engineering, ESAT-STADIUS, KU Leuven, Leuven, B-3001, Belgium

Xiaolin Huang [email protected] of Image Processing and Pattern Recognition, Shanghai Jiao Tong UniversityShanghai, 200240, China

Chen Gong [email protected] Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry ofEducationSchool of Computer Science and Engineering, Nanjing University of Science and TechnologyNanjing, 210094, China

Jie Yang ∗ [email protected] of Image Processing and Pattern Recognition, Shanghai Jiao Tong UniversityShanghai, 200240, China

Li Li [email protected]

Department of Automation, Tsinghua University, Beijing 100084, China

Editor: xxx

Abstract

Kernel methods have been extensively used in nonlinear learning. For complicated practical tasks,the traditional kernels, e.g., Gaussian kernel and sigmoid kernel, or their combinations are oftennot sufficiently flexible to fit the data. In this paper, we present a Data-Adaptive NonparametricKernel (DANK) learning framework in a data-driven manner. To be specific, in model formulation,we impose an adaptive matrix on the kernel/Gram matrix in an entry-wise strategy. Since we donot specify the formulation of the adaptive matrix, each entry in the adaptive matrix can be directlyand flexibly learned from the data. Therefore, the solution space of the learned kernel is largelyexpanded, which makes our DANK model flexible to fit the data. Specifically, the proposed kernellearning framework can be seamlessly embedded to support vector machines (SVM) and supportvector regression (SVR), which has the capability of enlarging the margin between classes andreducing the model generalization error. Theoretically, we demonstrate that the objective function ofour DANK model embedded in SVM/SVR is gradient-Lipschitz continuous. Thereby, the trainingprocess for kernel and parameter learning in SVM/SVR can be efficiently optimized in a unifiedframework. Further, to address the scalability issue in nonparametric kernel learning framework, wedecompose the entire optimization problem in DANK into several smaller easy-to-solve problems,so that our DANK model can be efficiently approximated by this partition. The effectiveness of thisapproximation is demonstrated by both empirical studies and theoretical guarantees. Experimentally,the proposed DANK model embedded in SVM/SVR achieves encouraging performance on various

∗. Jie Yang and Xiaolin Huang are corresponding authors.

1

arX

iv:1

808.

1072

4v2

[cs

.LG

] 2

8 M

ar 2

020

LIU, HUANG, GONG, YANG AND LI

classification and regression benchmark datasets when compared with other representative kernellearning based algorithms.

Keywords: nonparametric kernel learning, SVM, gradient-Lipschitz continuous

1. Introduction

Kernel methods (Scholkopf and Smola, 2003; Shawe-Taylor and Cristianini, 2000; Steinwart andAndreas, 2008) have proven to be powerful in a variety of machine learning tasks, e.g., classification(Vapnik, 1995), regression (Drucker et al., 1997), clustering (Dhillon et al., 2004), dynamical systems(Boots et al., 2013), and casual inference (Mitrovic et al., 2018). They employ a so-called kernelfunction k(·, ·) : Rd × Rd → R to compute the similarity between any two samples xi, xj ∈ Rdsuch that

k(xi,xj) = 〈φ(xi), φ(xj)〉H ,

where φ : X → H is a non-linear feature map transforming elements of input spaces X into areproducing kernel Hilbert space (RKHS)H. In practical use, there is no need to acquire the explicitformulation of such mapping. Instead, what we should obtain is only the inner product of two vectorsin the feature space by the kernel, a.k.a “kernel trick”.

Generally, the performance of kernel methods largely depends on the choice of the kernel.Traditional kernel methods often adopt a classical kernel, e.g., Gaussian kernel or sigmoid kernel forcharacterizing the relationship among data points. Empirical studies suggest that these traditionalkernels are not sufficiently flexible to depict the domain-specific characteristics of data affinities orrelationships. To address such limitation, several routes have been explored. One way is to designsophisticated kernels on specific tasks such as applying optimal assignment kernels (Kriege et al.,2016) to graph classification, developing a kernel based on triplet comparisons (Kleindessner andvon Luxburg, 2017), or even breaking the restriction of positive definiteness on the traditional kernel(Ong et al., 2004; Schleif and Tino, 2015; Loosli et al., 2016; Huang et al., 2017). Apart fromthese well-designed kernels, a series of research studies aim to automatically learn effective andflexible kernels from data, known as learning kernels. Algorithms for learning kernels can be roughlygrouped into two categories: parametric kernel learning and non-parametric kernel learning.

1.1 Review of Kernel Learning

In parametric kernel learning, the learned kernel k(·, ·) or the kernel matrixK = [k(xi,xj)]n×non the training data is assumed to admit a specific parametric form, and then the relevant parametersare learned according to the given data. The earliest work is proposed by Lanckriet et al. (2004),in which they consider training an SVM along with optimizing a linear combination of severalpre-given positive semi-definite (PSD) matrices Ktst=1 subject to a bounded trace constraint, i.e.,K =

∑st=1 µtKt with tr(K) ≤ c. Specifically, to ensure the learned kernel matrix valid, they

impose a PSD constraint onK in two implementation ways: one is directly usingK ∈ Sn+, whereSn+ denotes the positive semi-definite cone; the other is considering a nonnegaitve linear combination,i.e., a nonnegative constraint on µt with µt ≥ 0. In the above two schemes, the kernel (matrix)learning is transformed to learn the combination weights. Accordingly, the parameters in SVM andthe weights in their model can be learned by solving a semi-definite programming optimizationproblem in a unified framework.

2


The above parametric kernel learning framework spawns the new field of multiple kernellearning (MKL) (Bach et al., 2004; Varma and Babu, 2009). It aims to learn a good combinationof some predefined kernels (or kernel matrices) for rich representations. For example, the weightvector µ = [µ1, µ2, . . . , µs]

> can be restricted by the conic sum (i.e., µt ≥ 0), the convex sum (i.e.,µt ≥ 0 and

∑st=1 µt = 1), or various regularizers such as `1 norm, mixed norm, and entropy-based

formulations, see a survey (Gonen and Alpaydn, 2011). By doing so, MKL would generate a“broader” kernel to enhance the representation ability for data. Based on the idea of MKL, thereare several representative approaches to learn effective kernels by exploring the data information,including: i) hierarchical kernel learning (HKL) (Bach, 2008; Jawanpuria et al., 2015) learns froma set of base kernels assumed to be embedded on a directed acyclic graph; ii) spectral mixturemodels (Argyriou et al., 2005; Wilson and Adams, 2013) aim to learn the spectral density of akernel in a parametric scheme for discovering flexible statistical representations in data; iii) kerneltarget alignment (Cristianini et al., 2002; Cortes et al., 2012) seeks for the “best” kernel matrix bymaximizing the similarity between K and the ideal kernel yy> with the label vector y. Here theused ideal kernel can directly recognize the training data with 100% accuracy, and thus can be usedto guide the kernel learning task. Current works on this task often assume that the learned kernelmatrixK is in a parametric way, e.g., an MKL or mixtures of spectral density form, see (Lanckrietet al., 2004; Sinha and Duchi, 2016) and references therein.

Instead of assuming specific parametric-forms for the learned kernel, the nonparametric kernellearning framework acquires a positive semi-definite (PSD) kernel matrix in a data-specific manner,which are often more flexible than parametric kernel learning. Typical examples include: Lanckrietet al. (2004) consider K ∈ Sn+ in their optimization problem without extra parametric forms onK in a transductive setting, which results in a nonparametric kernel learning framework. Suchnonparametric kernel learning model is further explored by learning a low-rank kernel matrix (Kuliset al., 2009), imposing the pairwise constraints (Hoi et al., 2007; Zhuang et al., 2011) with side/priorinformation, or using the local geometry information and side-information (Lu et al., 2009). Besides,Ong et al. (2005) introduce a hyper reproducing kernel Hilbert space (hyper-RKHS), of which eachelement is a kernel function. Hence, the kernel can be learned in this space from a broader classwithout specifying its parametric form, which allows for significant model flexibility. Jain et al.(2012) investigate the equivalence between nonparametric kernel learning and Mahalanobis metriclearning, and accordingly propose a nonparametric model seeking for a PSD matrixW in a learnedkernel φ(x)>Wφ(x′) via the LogDet divergence.

1.2 Contributions

From the above discussions, we observe that, most studies recognize the significance of directlylearning nonparametric kernels in a broad class/space. Nevertheless, it appears intractable to seeka good trade-off between the model complexity and flexibility. Furthermore, most nonparametrickernel learning works utilize the semi-definite programming to tackle the PSD constraint. Thisscheme leads to a huge time complexity, which makes these approaches infeasible to real world largescale applications.

In this paper, we attempt to seek a suitable solving space for nonparametric kernel learning andscale it to large scale cases. We propose a Data-Adaptive Nonparametric Kernel (DANK) learningframework, in which a data adaptive matrix F ∈ Rn×n is imposed on a pre-given kernel matrixKin a point-wise strategy, i.e., F K, where denotes the Hadamard product between two matrices.

3


(a) σ = 1 (b) σ = 0.07 by cross validation

Figure 1: Classification boundaries of SVM (in blue dash line) and our DANK model (in red solidline) on the clowns dataset: (a) using the Gaussian kernel with σ = 1; (b) using the Gaussian kernelwith σ = 0.07 by cross validation.

Since we do not specify the formulation of the adaptive matrix F , each entry in F can be directlyand flexibly learned from the data. This formulation-free strategy contributes to adequate modelflexibility as a result. Specifically, the design for F is independent of the pre-given kernel matrix. Sowe can only restrict the flexibility of F to control the complexity of the whole model, which makes itpossible to seek a good trade-off between model flexibility and complexity. Here we take a syntheticclassification dataset clowns to illustrate this. The initial kernel matrixK is given by the classicalGaussian kernel k(xi,xj) = exp(−‖xi − xj‖22/2σ2) with the width σ. Figure 1 shows that, in theleft panel, the baseline SVM with an inappropriate σ = 1 cannot precisely adapt to the data, resultingin an inaccurate classification boundary. However, based on the sameK, by optimizing the adaptivematrix F , our DANK model shows good flexibility to fit the complex data distribution, leading to adesirable decision surface. Comparably, in the right panel, even under the condition that σ is welltuned via cross validation, the baseline SVM fails to capture the local property of the data (see inthe brown ellipses). Instead, by learning F , our DANK model still yields a more accurate boundarythan SVM to fit the data points. Regarding to the model complexity, Torr (2011) indicates that adecision surface that is too flexible would tend to overfit the data. It is better to prohibit locallylinearly separable. We find that, our classification boundary is approximately linear in a small region,which effectively controls the model complexity.

The main contributions of this paper lie in the following three folds: i) We propose a data-adaptive nonparametric kernel learning framework termed “DANK” to enhance the model flexibilityand data adaptivity, which is then seamlessly embedded to SVM and SVR for classification andregression tasks; ii) The DANK model and the induced classification/regression model can beformulated as a max-min optimization problem in a unified framework. The related objectivefunction is proven to be gradient-Lipschitz continuous, and thus can be directly solved by a projectedgradient method with Nesterov’s acceleration; iii) By decomposing the entire optimization probleminto several smaller easy-to-solve subproblems, we propose an accelerated strategy to handle thescalability issue that appears in our nonparametric kernel learning framework. The effectiveness ofour decomposition-based scalable approach is demonstrated by both theoretical and empirical studies.The experimental results on several classification and regression benchmark datasets demonstrate

4


the effectiveness of the proposed DANK framework over other representative kernel learning basedmethods.

This paper is an extension of our previous conference work (Liu et al., 2018). The improvementsof this paper are mainly in the following four aspects. First, in model formulation, we impose anadditional low rank constraint and a bounded regularizer on F , which makes the original modelfeasible to out-of-sample extensions. Apart from SVM to which our DANK model is embedded forclassification, the proposed DANK model is also extended to SVR for regression tasks. Second, wedevelop a Nesterov’s smooth method to solve the designed optimization problem, which requires thediscussion on its gradient-Lipschitz continuous property. Third, in large-scale situations, we conducta decomposition-based scalable approach on DANK with theoretical guarantees and experimentalvalidation. Lastly, we provide more experimental results on popular benchmarks.

1.3 Notation

We start with definitions of the notation used throughout this paper.

Matrices, vectors and elements: We takeA, a to be a matrix and a vector, of which elementsare Aij and ai, respectively. Denote In as the n × n identity matrix, 0 as a zero matrix or vectorwith the appropriate size, and 1n as the n-dimensional vector of all ones.

Sets: The set 1, 2, · · · , n is written as [n]. We call V1,V2, · · · ,Vs an s-partition of [n] ifV1 ∪ · · · ∪ Vs = [n] and Vp ∩ Vq = ∅ for p 6= q. Let |V| denote the cardinality of the set V . We takethe notation Sn as the set of n× n symmetric matrices and Sn+ as the n× n PSD cone.

Singular value decomposition (SVD): LetA ∈ Rn×d and r = rank(A). A (compact) singularvalue decomposition (SVD) is defined as A = UΣV> =

∑ri=1 σi(A)uiv

>i , where U , Σ, V are

an n × r column-orthogonal matrix, an r × r diagonal matrix with its diagonal element σi(A),and a d× r column-orthogonal matrix, respectively. If A is PSD, then U = V . Accordingly, thesingular value soft-thresholding operator is defined as Jτ (A) = UASτ (ΣA)V>A with the SVD:A = UΣV > and the soft-thresholding operator is Sτ (Aij) = sign(Aij) max(0, |Aij | − τ).

Matrix norms: We use four matrix norms in this paper

Frobenius norm: ‖A‖F =√∑

i,j A2ij =

√∑i σ

2i (A) .

Manhattan norm: ‖A‖M =∑

i,j |Aij | ≤ c′√n‖A‖F, where c′ is some constant.

Spectral norm: ‖A‖2 = max‖x‖2=1

‖Ax‖2 = σmax(A)

Nuclear norm: ‖A‖∗ =∑

i σi(A).Any square matrix satisfies tr(A) ≤ ‖A‖∗. IfA is PSD, then tr(A) = ‖A‖∗.

1.4 Paper Organization

In Section 2, we introduce the proposed DANK model embedded in SVM. The model optimiza-tion is presented in Section 3: Section 3.1 studies the gradient-Lipschitz continuous property andSection 3.2 applies Nesterov’s smooth optimization method to solve our model. Scalability of ournonparametric kernel model is addressed in Section 4. Besides, in Section 5, we extend our DANKmodel to SVR for regression. The experimental results on popular benchmark datasets are presentedin Section 6. Section 7 concludes the entire paper. Proofs are provided in the Appendices.

5


2. The DANK Model in SVM

In this section, we briefly review the SVM formulation and then incorporate the proposedDANK model into SVM for classification. We focus on binary classification problems for the ease ofdescription and it can be extended to multi-classification tasks.

We begin with the hard-margin SVM for linearly separable binary classification tasks. DenoteX ⊆ Rd as a compact metric space of features, and Y = −1, 1 as the label space, we assume that asample set Z = (xi, yi)ni=1 is drawn from a non-degenerate Borel probability measure ρ on X ×Y .The hard-margin SVM aims to learn a linear classifier f(x;w, b) = sign(w>x + b) ∈ −1,+1withw and b that determine the decision hyperplane. This is conducted by maximizing the distancebetween the nearest training samples of the two classes (a.k.a the margin γ = 1/‖w‖2), as this wayreduces the model generalization error (Vapnik, 1995). While the data are not linearly separablein most practical settings, the hard-margin SVM is subsequently extended to a soft-margin SVMwith an implicit mapping φ(·) for a non-linear decision hyperplane. Mathematically, the soft-marginSVM aims to maximize the margin γ (or minimize ‖w‖22) and minimize the slack penalty

∑ni=1 ξi

with the following formulation

minw,b,ξ

1

2‖w‖22 + C

n∑i=1

ξi

s.t. yi(w>φ(xi) + b) ≥ 1− ξi, ξi ≥ 0, i = 1, 2, · · · , n ,(1)

where ξ = [ξ1, ξ2, · · · , ξn]> is the slack variable and C is the balance parameter. As illustrated byVapnik (1995), the dual form of problem (1) is given by

maxα

1>α− 1

2α>Y KY α

s.t. 0 ≤ α ≤ C1, α>y = 0 ,(2)

where Y = diag(y) is the label matrix and K = [k(xi,xj)]n×n is the Gram matrix, satisfyingk(xi,xj) = 〈φ(xi), φ(xj)〉H. Since problem (2) is convex, strong duality holds by Slater’s condition(Boyd and Vandenberghe, 2004). The optimal values of the primal and dual soft-margin SVMproblems will be equal. Accordingly, if we attempt to learn the kernel matrix in the dual formof SVM in problem (2), the objective function value of the primal problem in problem (1) woulddecrease, which in turn increases the margin γ. Theoretically, for a fixed kernel, with probabilityat least 1 − δ, the estimation error (namely the gap between empirical error and expected error)of a SVM classifier with the margin γ is at most

√O(1/γ2 − log δ)/n (Koltchinskii et al., 2002).

If we consider the learned kernel in some finite/infinite kernel familiy K, the estimation errorcan be bounded by

√O(log |K|+ 1/γ2 − log δ)/n (Srebro and Ben-David, 2006), where |K|, the

cardinality, is a certain measure of the complexity of the kernel class. We find that, to achieve a tightbound, a learned SVM classifier is better to admit a large margin γ, and the hypothesis space cannotbe arbitrarily enlarging. Based on above analyses, we conduct the nonparametric kernel learning taskin our DANK model by introducing an adaptive matrix F into problem (2), that is

minF∈F

maxα∈A

1>α− 1

2α>Y

(FK

)Y α , (3)

where F is the feasible region of F ∈ Rn×n and the constraint for the standard SVM is expressed as

A = α ∈ Rn : α>y = 0, 0 ≤ α ≤ C1.

6


Accordingly, optimizing F yields a flexible kernel matrix F K, which is able to increase themargin, and further allows for greater model flexibility with a tighter estimation error as well as alower generalization error when compared to the original SVM classifier.

In our DANK model, since we do not specify the formulation of F , the solution space Fis enlarged to boost the model flexibility and capacity of fitting diverse data patterns. However,arbitrarily enlarging the solution spaceF could bring in two potential issues. One is that the generateddecision boundary that is too flexible would tend to overfit the data due to a large capacity of F .The other problem is significant difficulty in out-of-sample extensions (Bengio et al., 2004; Fanuelet al., 2017; Pan et al., 2017). That is, the flexible kernel matrix F K is learned from the trainingdata in a nonparametric manner, while the adaptive kernel matrix for test data is unknown. Suchout-of-sample extension problem is a common issue in nonparametric kernel learning. To alleviatethe above two issues, the learned F is expected to vary steadily and smoothly between any twoneighboring data points in a suitable compact space, and thus can be easily extended to the adaptivematrix F ′ for test data. Intuitively speaking, when F is chosen as the all-one matrix, the DANKmodel in Eq. (3) degenerates to a standard SVM problem. Such all-one matrix associated with itsrank-one property guides us to design the suitable space F in the following two aspects. First, theadaptive matrix F is expected to vary around the all-one matrix in a bounded region for a trade-offbetween the model complexity and flexibility. Second, F is desired to be endowed with the low rankstructure enjoyed by the all-one matrix. Mathematically, our DANK model is

minF∈Sn+

maxα∈A

1>α− 1

2α>Y

(FK

)Y α

s.t. ‖F − 11>‖2F ≤ R2, rank(F ) < r ,

(4)

where R refers to the bounded region size, rank(F ) denotes the rank of F , and r ≤ n is a giveninteger. In our DANK model, the initial kernel matrixK is not limited to the Gaussian kernel matrix,and can be chosen as an arbitrary PSD one. The constraint F ∈ Sn+ is given to ensure that thelearned kernel matrix F K is still a PSD one1. Since we do not specify the parametric form of F ,we can obtain a nonparametric kernel matrix F K and thus our DNAK model is nonparametric.The bounded constraint prevents F from dropping to a trivial solution F = 0n×n. Note that, ifwe attempt to find a new PSD kernel matrix K in an exact way such as ‖K −K‖2F, it would bedifficult to directly estimate K for test data. This is why we consider to split the representationinto F separating from K. Specifically, due to the non-convexity of the used rank constraint inproblem (4), we consider the nuclear norm ‖ · ‖∗ instead, which is the best convex lower bound ofthe non-convex rank function (Recht et al., 2010) and can be minimized efficiently. Accordingly, werelax the constrained optimization problem in Eq. (4) to a unconstrained problem by absorbing thetwo original constraints to the objective function. Moreover, following the min-max approach (Boydand Vandenberghe, 2004), problem (4) can be reformulated as

maxα∈A

minF∈Sn+

1>α− 1

2α>Y

(F K

)Y α+ η‖F − 11>‖2F + τη‖F ‖∗ , (5)

where η and τ are two regularization parameters. In problem (5), the inner minimization problemwith respect to F is a convex conic programming, and the outer maximization problem is a point-wise

1. It is admitted by Schur Product Theorem (Styan, 1973) which relates positive semi-definite matrices to the Hadamardproduct.

7


minimum of concave quadratic functions of α. As a consequence, problem (5) is also convex, and itsatisfies strong duality as we discussed above. Here we denote the objective function in Eq. (5) as

H(α,F ) = 1>α− 1

2α>Y

(FK

)Y α+ η‖F−11>‖2F+τη‖F ‖∗ ,

of which the optimal solution (α∗,F ∗) is a saddle point of H(α,F ) due to the property of themax-min problem (5). It is easy to check H(α,F ∗) ≤ H(α∗,F ∗) ≤ H(α∗,F ) for any feasible αand F . Further, we define the following function

h(α) , H(α,F ∗) = minF∈Sn+

H(α,F ) , (6)

which is concave since h is the minimum of a sequence of concave functions. Specifically, theoptimal solution F ∗ can be restricted in a bounded set F as demonstrated by the following Lemma.

Lemma 1 Given the training data xi, yini=1 with labels yi ∈ +1,−1 and a pre-given kernelmatrixK, there exists an equivalence of problem (5) with

maxα∈A

minF∈Sn+

H(α,F ) = maxα∈A

minF∈F

H(α,F )︸︷︷︸,h(α)

, (7)

where the feasible region on F is defined by F :=F ∈ Sn+ : λmax(F ) ≤ n− τ

2 + nC2

4η λmax(K)

as a nonempty subset of Sn+ and λmax(F ) denotes the largest eigenvalue of F .

Proof The key of the proof is to obtain the optimal solution F ∗ over Sn+, i.e., F ∗ = argminF∈Sn+

H(α,F )

in Eq. (5). By virtue of the following expression2

1

2α>Y

(FK

)Y α =

1

4ηdiag(α>Y )K diag(α>Y ) , Γ(α) , (8)

finding F ∗ is equivalent to consider the following problem

F ∗ := argminF∈Sn+

−2tr(ηΓ(α)F

)+ η‖F − 11>‖2F + τη‖F ‖∗ , (9)

where we omit the irrelevant term 1>α independent of the optimization variable F in Eq. (5). Further,due to the independence of Γ(α) on F , problem (9) can be reformulated in a compact form as

F ∗ = argminF∈Sn+

‖F − 11> − Γ(α)‖2F + τ‖F ‖∗ (10)

Note that the regularization parameter η is implicitly included in Γ(α). Following (Cai et al., 2010),we can directly obtain the optimal solution of problem (10) with

F ∗ = J τ2(11> + Γ(α)) ,

2. We use the formula x>ABy = tr(DxADyB>) withDx = diag(x) andDy = diag(y).

8


(a) (b)

Figure 2: The range of F ∗ in our DANK model with the Gaussian kernel on the clowns dataset: (a)σ = 1; (b) σ = 0.1 by cross validation.

where we use the singular value thresholding operator as the proximity operator associated with thenuclear norm, refer to Theorem 2.1 in (Cai et al., 2010) for details. After giving the analytic solutionF ∗, in the next we aim to seek its upper bound

λmax(F ∗) = λmax

(J τ

2

(11>+Γ(α)

))=λmax

(11>+Γ(α)

)− τ

2

≤ λmax(11>)+1

4ηλmax

(diag(α>Y )K diag(α>Y )

)− τ

2

= n+1

4η

∥∥∥diag(α>Y )K diag(α>Y )∥∥∥

2− τ

2

≤ n+1

4η

∥∥diag(α>Y )∥∥2

2

∥∥K∥∥2− τ

2

≤ n− τ

2+nC2

4ηλmax(K) ,

(11)

where the first inequality uses the property of maximum eigenvalues, i.e., λmax(A+B) ≤ λmax(A)+λmax(B) for anyA,B ∈ Sn, and the last inequality admits by ‖α‖22 ≤ nC2.

Lemma 1 gives the upper spectral bound of F ∗, i.e., F ∗ is bounded in F , which is a subset of thePSD cone Sn+. In this case, we can directly solve F in the subset F ⊆ Sn+ instead of the entire PSDcone Sn+. The solving space is decreased, that means, we can seek for a good trade-off betweenthe model flexibility and complexity. Besides, our DANK model requires that the adaptive matrixF ∈ F is expected to vary around the all-one matrix in a small bounded region. Hence, each elementin [Γ(α)]ij cannot be significantly larger than 1. To this end, recall Eq. (8) and problem (10), wechoose η ∈ O(n) to ensure tr[Γ(α)] to be in the same order with tr(11>) ∈ O(n).

Here we experimentally show the effectiveness of the introduced constraints on F for a trade-offbetween the model flexibility and complexity. Figure 2 shows the range of the optimal solution F ∗

generated by our DANK model in Eq. (5) on the clowns dataset with σ = 1 (left panel) and σ = 0.07(right panel), respectively. One can see that, in these two cases, the values in F ∗ range from 0.9 to1.1, and we obtain rank(F ∗) = 7 (left panel) and rank(F ∗) = 9 (right panel), respectively. Thereby,

9


such small fluctuation on F ∗ and its low rank structure effectively control the model complexity, andthus we can easily extend F to test data.

Regarding to the out-of-sample extension issue, given the optimal F ∗ on training data, the testdata x′imi=1, and the initial kernel matrix for test data K ′ = [k(xi,x

′j)]n×m, we aim to establish

the adaptive matrix F ′ for test data. To this end, we use a simple but effective technique, i.e.,the reciprocal nearest neighbor scheme to acquire F ′. Formally, we first construct the similaritymatrix M between the training data and the test data based on the nearest neighbor scheme. Theused distance to find the nearest neighbor is the standard ‖xi − xj‖2 metric in the d-dimensionalEuclidean space. Assuming that x′j is the r-th nearest neighbor of xi in the set of x′tmt=1, denotedas x′j = NNr(xi, x′tmt=1). Meanwhile xi is the s-th nearest neighbor of x′j in xtnt=1, denoted asxi = NNs(x

′j , xtnt=1), then the similarity matrixM ∈ Rn×m is defined by

Mij =1

rs, if x′j = NNr

(xi, x′tmt=1

)∧ xi = NNs

(x′j , xtnt=1

), ∀i ∈ [n], j ∈ [m] , (12)

which is a much stronger and more robust indicator of similarity than the simple and unidirectionalnearest neighborhood relationship, since it takes into account the local densities of vectors aroundxi and x′j . This reciprocal nearest scheme has been extensively applied to computer vision, such asimage retrieval (Qin et al., 2011) and person re-identification (Zhong et al., 2017; Zheng et al., 2012).Accordingly, F ′ is given by

F ′j ← F ∗j∗ , if j∗ = argmaxi

M1j ,M2j , · · · ,Mij , · · · ,Mnj ∀i ∈ [n] ,

where Mj = [M1j ,M2j , · · · ,Mnj ]> ∈ Rn describes the relationship between x′j and the training

data xtnt=1. That is to say, if xj∗ is the “optimal” reciprocal nearest neighbor of x′j among thetraining data set xtnt=1, the j∗-th column of F ∗ is assigned to the j-th column of F ′. By doing so,F ∗ can be directly extended to F ′ ∈ Rn×m, resulting in the flexible kernel matrix F ′ K ′ for testdata. Admittedly, we would be faced with the inconsistency if we directly extend the training kernelto the test kernel in this way. However, in Eq. (5), F is designed to vary in a small range, and isexpected to smoothly vary between any two neighboring data points in F , so the extension to F ′ ontest data by this scheme is reasonable. Further, we attempt to provide some theoretical justificationfor this scheme as follows.

Mathematically, ifx andx′ are the reciprocal nearest neighbor pair, we expect that ‖ϕ>(x)ϕ(x′j)−ϕ>(x′)ϕ(x′j)‖22 is small on the test data x′jmj=1, where ϕ is the learned implicit feature mappingsuch that FijKij = 〈ϕ(xi), ϕ(xj)〉H on the training data. Balcan et al. (2006) demonstrate that, inthe presence of a large margin γ, a kernel function can also be viewed as a mapping from the inputspace X into an O(1/γ2) space.

Proposition 2 (Balcan et al., 2006) Given 0 < ε ≤ 1, the margin γ and the implicit mapping ϕ(·) inSVM, then with probability at least 1−δ, let d′ be a positive integer such that d′ ≥ d0 = O( 1

γ2log 1

εδ ),for any x ∈ xini=1 ⊂ Rd and x′ ∈ x′imj=1 ⊂ Rd with d ≥ d′, we have

(1− ε)‖x− x′‖22 ≤ ‖ϕ(x)− ϕ(x′)‖22 ≤ (1 + ε)‖x− x′‖22 ,

where the mapping ϕ is a random projection following with the Gaussian distribution or the uniformdistribution.

10


Remark: WhenK is a Gaussian kernel matrix and F is bounded, the learned kernel associated withF K is sub-Gaussian. So the above bounds can be achieved for the sub-Gaussian distribution witha tail bound condition (Shi et al., 2012). Regarding to d′, we choose the lower bound d = d′ = d0 =O( 1

γ2log 1

εδ ), and thus we have d = O( 1γ2

log 1εδ ). As the margin is γ = 1

‖w‖2 and ‖w‖22 ∈ O(d)

for w ∈ Rd, we have O( 1γ2

log 1εδ ) = O(d) in the same order of d. Hence the required condition

d ≥ d′ is satisfied.Based on the above analysis, given the “optimal” reciprocal nearest neighbor pair (x,x′), for

any test data point x′j with j ∈ [m], we have

‖ϕ>(x)ϕ(x′j)− ϕ>(x′)ϕ(x′j)‖22 ≤ ‖ϕ(x)− ϕ(x′)‖22‖ϕ>(x′j)ϕ(x′j)‖22≤ ‖ϕ>(x′j)ϕ(x′j)‖22︸︷︷︸

effected by F K for x′j

(1 + ε) ‖x− x′‖22︸︷︷︸small

≤ C(1 + ε)‖x− x′‖22 ,

(13)

where C is a constant since the learned kernel is bounded. This is because, first, by controlling thecomplexity of the solution space F , the learned F , of which values are around 1, is bounded. Second,the initial kernel function is often assumed to be bounded, i.e., supx∈X k(x,x) <∞ (Bach, 2017).So the learned kernel is bounded, and we can easily extend it to a new data x′j without divergence.As a result, we can obtain a small ‖ϕ>(x)ϕ(x′j)− ϕ>(x′)ϕ(x′j)‖22 if ‖x− x′‖22 is small, which isbeneficial to out-of-sample extensions. And specifically, in Section 6.2.4, we experimentally evaluatethe performance of different out-of-sample extension based algorithms to validate the justification ofour reciprocal neatest neighbor scheme.

To sum up, we formulate the proposed DANK model embedded in SVM as a max-min optimiza-tion problem including learning the adaptive matrix F and standard SVM parameters in problem (5).In the next section, we develop a smooth optimization algorithm to directly solve the DANK modelin SVM.

3. Algorithm for DANK model in SVM

Extra-gradient based methods can be directly applied to solve the max-min problem (5) (i.e.,a convex-concave min-max problem), and have been shown to exhibit an O(1/t) convergencerate (Nemirovski, 2004), where t is the iteration number. Further, to accelerate the convergencerate, this section investigates the gradient-Lipschitz continuity of h(α) in Eq. (6). Based on this,we introduce the Nesterov’s smooth optimization method (Nesterov, 2005) that requires ∇h(α)Lipschitz continuous to solve problem (5), that is shown to achieve O(1/t2) convergence rate.

3.1 Gradient-Lipschitz continuity of DANK in SVM

To prove the gradient-Lipschitz continuity of h(α), we need the following lemma.

Lemma 3 For any α1, α2 ∈ A, we have∥∥F (α1)− F (α2)∥∥

F≤‖K‖

(‖α1‖+ ‖α2‖

)4η

∥∥α1 −α2

∥∥2,

where F (α1) = J τ2

(11> + Γ(α1)

)and F (α2) = J τ

2

(11> + Γ(α2)

).

11


Algorithm 1: Projected gradient method with Nesterov’s acceleration for problem (5)Input: The kernel matrixK, the label matrix Y , and the Lipschitz constant

L = n+3nC2‖K‖2F

4ηOutput: The optimal α∗

1 Set the stopping criteria tmax = 2000 and ε = 10−4.2 Initialize t = 0 and α(0) ∈ A := 0.3 Repeat4 Compute F (α(t)) = J τ

2

(11> + Γ(α(t))

);

5 Compute ∇h(α(t)) = 1− Y(F (α(t))K

)Y α(t) ;

6 Compute θ(t) =PA(α(t) + 1

L∇h(α(t)))

;

7 Compute β(t) =PA(α(0)− 1

2L

∑ti=0(i+ 1)∇h(α(i))

);

8 Set α(t+1) = t+1t+3θ

(t) + 2t+3β

(t);9 t := t+ 1;

10 Until t ≥ tmax or ‖α(t) −α(t−1)‖2 ≤ ε;

Proof The proofs can be found in Appendix A.1.

Formally, based on Lemmas 1 and 3, we present the following theorem.

Theorem 4 The function h(α) is gradient-Lipschitz continuous with a Lipschitz constant L =

n +3nC2‖K‖2F

4η , i.e., for any α1, α2 ∈ A, the inequality ‖∇h(α1) − ∇h(α2)‖2 ≤ L‖α1 − α2‖2holds.

Proof The proofs can be found in A.2.

The above theoretical analyses demonstrate that∇h(α) is Lipschitz continuous, which provides ajustification for utilizing a smooth optimization Nesterov’s acceleration method to solve problem (5)with faster convergence.

3.2 Nesterov’s Smooth Optimization Method

Here we introduce a projected gradient algorithm with Nesterov’s acceleration to solve theoptimization problem (5). Nesterov (2005) proposes an optimal scheme for smooth optimizationminx∈Q g(x), where g is a convex gradient-Lipschitz continuous function over a closed convex setQ. Introducing a continuous and strongly convex function denoted as proxy-function d(x) on Q,the first-order projected gradient method with Nesterov’s acceleration can then be used to solve thisproblem. In our model, we aim to solve the following convex problem

maxα∈A

h(α) , (14)

where h(α) is concave and gradient-Lipschitz continuous with the Lipschitz constant L = n +3nC2‖K‖2F

4η as demonstrated in Theorem 4. Here the proxy-function is defined as d(α) = 12‖α−α0‖22

with α0 ∈ A. The first-order Nesterov’s smooth optimization method for solving problem (5) issummarized in Algorithm 1.

12


The key steps of Nesterov’s acceleration are characterized by Lines 6, 7, and 8 in Algorithm 1.To be specific, according to Nesterov (2005), at the t-th iteration, we need to solve the followingproblem

β(t) = argminα∈A

L

2‖α−α0‖22 +

t∑i=0

i+ 1

2

[h(α(i)) +

⟨∇h(α(i)),α−α(i)

⟩], (15)

which is equivalent to


‖α−α0‖22 +2

L

t∑i=0

i+ 1

2∇h>(α(i))α ,

where we omit the irrelevant terms h(α(i)) and ∇h>(α(i))α(i) that are independent of the op-timization variable α in Eq. (15). Accordingly, the above problem can be further reformulatedas


∥∥∥α−α0 +1

2L

t∑i=0

(i+ 1)∇h(α(i))∥∥∥2

2,

of which the optimal solution is β(t) = PA(α0− 1

2L

∑ti=0(i+ 1)∇h(α(i))

)as outlined in Line 7 in

Algorithm 1, where PA(α) is a projection operator that projects α over the set A. A quick note onprojection onto the feasible set A = α ∈ Rn : α>y = 0, 0 ≤ α ≤ C1: it typically suffices inpractice to use the alternating projection algorithm (Von Neumann, 1949). Since the feasible set A isthe intersection of a hyperplane and a hypercube, both of them admit a simple projection step. To bespecific, first clip α to [0, C], and then project on the hyperplane α← α− y>α

n y. We use 10 suchalternating projections to obtain the projected α. The convergence rate of the alternating projectionalgorithm is shown to be linear (Von Neumann, 1949) and thus it is very efficient.

It can be noticed that when Lines 6, 7, and 8 in Algorithm 1 are replaced by

α(t+1) =PA(α(t) +

1

L∇h(α(t))

)(16)

with the Lipschitz constant L = n − τ2 + nC2

4η λmax(K) derived from Lemma 1, the Nesterov’ssmooth method degenerates to a standard projected gradient method.

The convergence of the Nesterov smoothing optimization algorithm is pointed out by Theorem2 in (Nesterov, 2005), namely

h(α∗)− h(β(t)) ≤ 8L‖α0 −α∗‖2

(t+ 1)(t+ 2),

where α∗ is the optimal solution of Eq. (14). Note that, in general, Algorithm 1 cannot guaranteeh(α(t)) : t ∈ N and h(β(t)) : t ∈ N to be monotonely increasing during the maximizationprocess. Nevertheless, such algorithm can be modified to obtain a monotone sequence with replacingLine 6 in Algorithm 1 by

θ(t) = PA(α(t) +

1

L∇h(α(t))

),

θ(t) = argmaxα

h(α), α ∈ θ(t−1), θ(t),α(t) .

13


The Nesterov’s smooth optimization method takes O(√L/ε) to find an ε-optimal solution,

which is better than the standard projected gradient method with the complexity O(L/ε).

4. DANK in Large Scale Case

Scalability in kernel methods is a vital issue which often limits their applications in largedatasets (Maji et al., 2013; Rahimi and Recht, 2007; Wang et al., 2016), especially for nonparametrickernel learning using semi-definite programming. Hence, in this section, we investigate kernelapproximation in nonparametric kernel learning, and take our DANK model embedded in SVM asan example to illustrate this scheme. The presented theoretical results in this section are also suitableto other nonparametric kernel learning based algorithms.

To consider the scalability of our DANK model embedded in SVM in large-scale situations,problem (5) is reformulate as

maxα

minF∈Sn+

H(α,F ) = 1>α− 1

2α>Y

(F K

)Y α+ η‖F − 11>‖2F

s.t. 0 ≤ α ≤ C1 .

(17)

where the bias term b is usually omitted in the large scale issue (Keerthi et al., 2006; Hsieh et al.,2014; Lian and Fan, 2017) and we omit the low rank regularizer on F as well due to its inseparableproperty, which is reasonable based on the rapid decaying spectra of the kernel matrix (Smola andScholkopf, 2000). In Section 6.2.2, we will verify that this low rank term can be directly droppedwithout sacrificing too much performance in large-scale situations.

In our decomposition-based scalable approach, we divide the data into small subsets by k-means,and then solve each subset independently and efficiently. Such similar idea also exists in (Hsieh et al.,2014; Zhang et al., 2013; Si et al., 2017). To be specific, we firstly partition the data into v subsetsV1,V2, . . . ,Vv, and then solve the respective sub-problems independently with the followingformulation

maxα(c)

minF (c,c)∈S|Vc|+

1>α(c)+η‖F (c,c) − 11>‖2F −1

2α(c)>Y (c,c)

(F (c,c) K(c,c)

)Y (c,c)α(c)

s.t. 0 ≤ α(c) ≤ C1, ∀ c = 1, 2, . . . , v ,

(18)

where |Vc| denotes the number of data points in Vc. Suppose that (α(c), F (c,c)) is the optimal solutionof the c-th subproblem, the approximation solution (α, F ) to the whole problem is concatenated byα = [α(1), α(2), . . . , α(v)] and F = diag(F (1,1), F (2,2), . . . , F (v,v)), where F is a block-diagonalmatrix.

The above kernel approximation scheme makes the nonparametric kernel learning frameworkfeasible to large-scale situations. In the next, we theoretically demonstrate the decomposition-basedscalable approach in three aspects. First, the objective function value H(α,F ) in Eq. (17) is close toH(α∗,F ∗). Second, the approximation solution (α, F ) is close to the optimal solution (α∗,F ∗).Third, if xi is not a support vector of the subproblem, it will also be a non-support vector of thewhole problem under some conditions. To prove the above three propositions, we need the followinglemma that links the subproblems to the whole problem.

14


Lemma 5 (α, F ) is the optimal solution of the following problem

maxα

minF∈Sn+

H(α,F ) , 1>α− 1

2α>Y

(F K

)Y α+ η‖F − 11>‖2F

s.t. 0 ≤ α ≤ C1 ,

(19)

with the kernel K defined byKij = I(π(xi), π(xj))Kij ,

where π(xi) is the cluster that xi belongs to, and I(a, b) = 1 iff a = b, and I(a, b) = 0 otherwise.

Proof The proofs can be found in Appendix B.1.

Based on the above lemma, we are ready to investigate the difference between H(α∗,F ∗) andH(α,F ) as follows.

Theorem 6 Given the training data xi, yini=1 with labels yi ∈ +1,−1 and a partition indicatorπ(x1), π(x2), · · · , π(xn), denote (α∗,F ∗) and (α, F ) as the optimal solutions of problem (5)and problem (19), respectively. Suppose that each element in F ∗ and F satisfies 0 < B1 ≤maxF ∗ij , Fij ≤ B2, with B = B2 −B1, we have

∣∣H(α∗,F ∗)−H(α,F )∣∣ ≤ 1

2BC2Q(π) ,

with Q(π) =∑n

i,j:π(xi) 6=π(xj)|k(xi,xj)| and the balance parameter C in SVM.


Remark: Q(π) actually consists of the off-diagonal values of the kernel matrixK if we rearrangethe training data in a clustering order. It depends on the data distribution, the number of clustersv, and the kernel type. Intuitively, if the clusters are nicely shaped (e.g. Gaussian) and well-separated, then the kernel matrix may be approximately block-diagonal. In this case, Q(π) wouldbe small. Let we examine two extreme cases of v. If v = 1, i.e., only one cluster, we haveQ(π) = 0; If v = n, i.e., each data point is grouped into a cluster, then Q(π) can be upper boundedby Q(π) ≤

∑ni,j |k(xi,xj)| := ‖K‖M ≤ c′

√n‖K‖F, where c′ is some constant. In practical

clustering algorithms, v is often chosen to be much smaller than n, i.e., v n, and thus we canobtain a small Q(π).

Next we investigate the approximation error between the approximation solution (α, F ) andthe optimal solution (α∗,F ∗) by the following theorem.

Theorem 7 Under the same condition of Theorem 6, the error between (α, F ) and (α∗,F ∗) canbe bounded by

‖α∗ − α‖22 ≤B2C

2Q(π)

B1‖K‖F+

2BC2

B1:= O(

√n) ,

‖F ∗ − F ‖F ≤‖K‖F

2η

√nB2Q(π)

B1‖K‖F+

2nB

B1C2 +

C2Q(π)

4η:= O(n

34 ) .

(20)

15



Remark: The derived approximation error about α by Theorem 7 is improved from the exact boundO(n) to O(

√n) due to Q(π) ≤ c′

√n‖K‖F. Compared to the intuitive bound ‖F ∗ − F ‖F ≤

nmax√B,B ∈ O(n), the obtained error bound regarding to F by Theorem 7 is improved to

O(n34 ) due to ‖K‖F ∈ O(n) and η ∈ O(n). In Section 6.2.3, we will experimentally check whether

our derived bounds are tight.The above theoretical results demonstrate the approximation performance of the subproblems

to the whole problem, regarding to the difference of their respective objective function values inTheorem 6, and the difference of their respective optimization variables in Theorem 7. Besides, inSVM, we also concern about the relationship of support/non-support vectors between the subproblemsand the whole problem. Accordingly, we present the following theorem to explain this issue.

Theorem 8 Under the same condition of Theorem 6 with an additional bounded assumptionsupx,x′∈X |k(x,x′)| < κ, suppose that xi is not a support vector of the subproblem, i.e., αi = 0, xiwill also not be a support vector of the whole problem i.e., αi = 0, under the following condition(

∇αH(α, F ))i≤−(B+B2)C

(‖Ki‖1+κ

)≤ −(B +B2)C (‖K‖1 + κ) , (21)

where Ki denotes the i-th column of the kernel matrix K.


Remark: Eq. (21) is a sufficient condition and can be expressed as

O(n) =: 1− nB2κC ≤(∇αH(α, F )

)i≤ −(B +B2)C

(‖Ki‖1 + κ

):= O(

√n) ,

where the first inequality admits inf(∇αH(α, F )

)i

= inf(

1−∑n

j=1 yiyjFijKijαj

)= 1 −

nB2κC. So if we assume that(∇αH(α, F )

)i

of non-support vectors is uniformly distributed

over the range [1−nκB2C, 0], then nearly 1− (B+B2)C(‖Ki‖1+κ)nB2κC−1 ≈ 1− c√

nof the total non-support

vectors can be directly recognized, where c is some constant. Hence, the screening proportion ofnon-support vectors is (1− c√

n) ∗ 100%, increasing at a certain O(n−1/2) rate. If

(∇αH(α, F )

)i

follows with some heavy-tailed distributions over the range [1 − nκB2C, 0], the recognized ratewould decrease. And specifically, we will experimentally check that our screening condition isreasonable in Section 6.2.3.

5. DANK Model in SVR

In this section, we incorporate the DANK model into SVR and also develop the Nesterov’ssmooth optimization algorithm to solve it. Here the label space is Y ⊆ R for regression.

Similar to DANK in SVM revealed by problem (5), we incorporate the DANK model into SVRwith the ε-insensitive loss, namely

maxα,α

minF∈Sn+

−1

2(α− α)>

(F K

)(α− α) + (α− α)>y

− ε(α+ α)>1 +η‖F−11>‖2F+τη‖F ‖∗s.t. 0 ≤ α, α ≤ C, (α− α)>y = 0 ,

(22)

16


where the dual variable is α = α − α. The objective function in problem (22) is denoted asH(α, α,F ). Further, we define the following function

h(α, α) , H(α, α,F ∗) = minF∈Sn+

H(α, α,F ) , (23)

where h(α, α) can be obtained by solving the following problem

minF∈Sn+

‖F − 11> − Γ(α, α)‖2F + τ‖F ‖∗ , (24)

with Γ(α, α) = 14η diag(α − α)>K diag(α − α). The optimal solution of Eq. (24) is F ∗ =

J τ2(11> + Γ(α, α)). We can easily check that Lemma 1 is also applicable to problem (23)

h(α, α) = minF∈B

H(α, α,F ) .

Similar to Lemma 3, in our DANK model embedded in SVR, ‖Γ(α1, α1)− Γ(α2, α2)‖2 canbe bounded by the following lemma.

Lemma 9 For any α1, α1, α2, α2 ∈ A, we have∥∥F (α1, α1)− F (α2, α2)∥∥

F≤ ‖Γ(α1, α1)− Γ(α2, α2)‖F

≤ ‖K‖4η

∥∥α1 − α1 + α2 − α2

∥∥2

∥∥α1 − α1 − α2 + α2

∥∥2,

where F (α1, α1)) = J τ2

(11> + Γ(α1, α1)

)and F (α2, α2) = J τ

2

(11> + Γ(α2, α2)

).

The proof of Lemma 9 is similar to that of Lemma 3, and here we omit the detailed proof. Next wepresent the partial derivative of h(α, α) regarding to α and α.

Proposition 10 The objective function h(α, α) with two variables defined by Eq. (23) is differen-tiable and its partial derivatives are given by

∂h(α, α)

∂α= −εI − (α− α)F K + y ,

∂h(α, α)

∂α= −εI − (α− α)F K − y .

(25)

Formally, h(α, α) is proven to be gradient-Lipschitz continuous by the following theorem.

Theorem 11 The function h(α, α) with its partial derivatives in Eq. (25) is gradient-Lipschitz

continuous with the Lipschitz constant L = 2(n+

9nC2‖K‖2F4η

), i.e., for any α1, α1, α2, α2 ∈ A, let

the concentration vectors be α1 = [α>1 , α>1 ]> and α2 = [α>2 , α

>2 ]>, and the partial derivatives be

∇α1h(α1, α1)=

[(∂h(α, α1)

∂α

∣∣α=α1

)>,(∂h(α1, α)

∂α

∣∣α=α1

)>]>,

∇α2h(α2, α2)=

[(∂h(α, α2)

∂α

∣∣α=α2

)>,(∂h(α2, α)

∂α

∣∣α=α2

)>]>,

we have

‖∇α1h(α1,α1)−∇α2

h(α2,α2)‖2≤2L(‖α2−α1‖2+‖α2−α1‖2

).

17


Proof The proofs can be found in Appendix A.3.

Based on the gradient-Lipschitz continuity of h(α, α) demonstrated by Theorem 11, we areready to present the first-order Nesterov’s smooth optimization method for problem (22). The smoothoptimization algorithm is summarized in Algorithm 2.

Algorithm 2: Projected gradient method with Nesterov’s acceleration for problem (22)Input: The kernel matrixK, the label matrix Y , and the Lipschitz constant

L = 2(n+

9nC2‖K‖2F4η

)Output: The optimal α∗

1 Set the stopping criteria tmax = 2000 and ε = 10−4.2 Initialize t = 0 and α(0), α(0) ∈ A := 0.3 Repeat4 Compute F (α(t), α(t)) = J τ

2

(11> + Γ(α(t), α(t))

);

5 Compute ∂h/∂α and ∂h/∂α by Eq. (25), and concentrate them as∇h(α(t), α(t)) = [(∂h/∂α)>, (∂h/∂α)>]> ;

6 Compute θ(t) =PA(

[α(t)>, α(t)>]> + 12L∇h(α(t), α(t))

);

7 Compute β(t) =PA([α(0)>, α(0)>]>− 1

4L

∑ti=0(i+ 1)∇h(α(i), α(i))

);

8 Set [α(t+1)>, α(t+1)>]> = t+1t+3θ

(t) + 2t+3β

(t);9 Set α(t+1) = α(t+1) − α(t+1) and t := t+ 1;

10 Until t ≥ tmax or ‖α(t) −α(t−1)‖2 ≤ ε;

6. Experimental Results

This section evaluates the performance of our DANK model in comparison with several rep-resentative kernel learning algorithms on classification and regression benchmark datasets. Allthe experiments implemented in MATLAB are conducted on a Workstation with an Intelr Xeonr

E5-2695 CPU (2.30 GHz) and 64GB RAM. The source code of the proposed DANK method will bepublicly available.

6.1 Classification tasks

We conduct experiments on the UCI Machine Learning Repository3 with small scale datasets,and three large datasets including EEG, ijcnn1 and covtype4. Besides, the CIFAR-10 database5 forimage classification is used to evaluate the compared algorithms.

3. https://archive.ics.uci.edu/ml/datasets.html4. All datasets are available at https://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/5. https://www.cs.toronto.edu/˜kriz/cifar.html

18

https://archive.ics.uci.edu/ml/datasets.html

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

https://www.cs.toronto.edu/~kriz/cifar.html


Table 1: Comparison results in terms of classification accuracy (mean±std. deviation %) on theUCI datasets. The best performance is highlighted in bold. The classification accuracy on thetraining data is presented by italic, and does not participate in ranking. Notation “•” indicates thatthe data-adaptive based algorithm (KNPL or DANK) is significantly better than other representativekernel learning methods via paired t-test at the 5% significance level.

Dataset (d, n) LogDet BMKL RF SVM-CV KNPL DANK

Test Test Test Training Test Test Training Test

diabetic (19, 1151) 78.7±1.8 74.9±0.4 72.3±0.8 80.7±3.9 73.0±1.7 81.9±1.7• 87.0±1.9 81.2±1.4•heart (13, 270) 80.6±3.5 85.6±0.8 79.1±2.4 88.9±3.0 81.9±2.4 87.4±3.9• 94.3±1.7 87.9±2.9•monks1 (6, 124) 86.6±0.2 78.9±2.5 84.4±0.9 90.3±0.0 81.4±0.0 83.3±3.3 100.0±0.0 83.6±1.5monks2 (6, 169) 85.9±1.2 82.1±1.3 73.6±1.1 100.0±0.0 85.8±1.4 83.3±1.6 100.0±0.0 86.7±0.9monks3 (6, 122) 94.0±1.3 94.0±1.0 93.7±0.6 96.2±1.5 93.0±1.2 88.7±1.2 97.2±1.8 93.0±0.9sonar (60, 208) 84.1±2.2 84.8±0.6 80.5±3.1 99.9±0.3 85.3±3.1 85.8±2.8 100.0±0.0 87.0±2.7•spect (21, 80) 79.6±3.7 78.8±0.8 76.0±2.7 87.0±3.7 73.1±3.2 79.7±4.8 93.4±4.1 78.9±3.2glass (9, 214) 71.2±1.0 68.2±4.6 68.2±2.2 77.1±6.9 69.8±2.0 72.4±2.3 89.7±6.1 74.5±1.3•fertility (9, 100) 86.2±1.1 84.4±1.6 84.4±4.3 94.4±5.3 85.2±1.7 85.6±3.8 97.3±3.3 87.6±2.3•wine (13, 178) 96.1±1.8 95.0±2.8 95.1±1.1 99.5±1.0 94.7±1.5 96.1±2.0 99.5±1.0 96.4±2.0

6.1.1 CLASSIFICATION RESULTS ON UCI DATABASE

Ten datasets from the UCI database are used to evaluate our DANK model embedded in SVM.Here we describe experimental settings and the compared algorithms as follows.Experimental Settings: Table 1 lists a brief description of these ten datasets including the numberof training data n and the feature dimension d. After normalizing the data to [0, 1]d by a min-maxscaler, we randomly pick half of the data for training and the rest for test except for monks1, monks2,and monks3. In these three datasets, both training and test data have been provided. The Gaussiankernel k(xi,xj) = exp(−‖xi − xj‖22/2σ2) is chosen as the initial kernel in our model. The kernelwidth σ and the balance parameter C are tuned by 5-fold cross validation on a grid of points, i.e.,σ = [2−5, 2−4, . . . , 25] and C = [2−5, 2−4, . . . , 25]. We experimentally set the penalty parameterτ to 0.01. The regularization parameter η is fixed to ‖α‖22 obtained by SVM. The experiments areconducted 10 times on these ten datasets.Compared Methods: We include the following kernel learning based algorithms:

• BMKL (Gonen, 2012): A multiple kernel learning algorithm uses Bayesian approach toensemble the Gaussian kernels with ten different kernel widths and the polynomial kernelswith three different degrees.

• LogDet (Jain et al., 2012): A nonparametric kernel learning approach aims to learn a PSDmatrixW in a learned kernel φ(x)>Wφ(x′) with the LogDet divergence.

• RF (Sinha and Duchi, 2016): A kernel alignment based learning framework creates randomizedfeatures, and then solves a simple optimization problem to select a subset. Finally, the kernelis learned from the optimized features by target alignment.

• KNPL (Liu et al., 2018): This nonparametric kernel learning framework is given by ourconference version, which shares the initial ideas about learning in a data-adaptive scheme

19


Table 2: Dataset statistics and parameter settings on several large datasets.

datasets d #training #test C 1/2σ2 #clusters

EEG 14 7,490 7,490 32 100 5ijcnn1 22 49,990 91,701 32 2 50

covtype 54 464,810 116,202 32 32 200

Table 3: Comparison of test accuracy and training time of all the compared algorithms on several largedatasets. The rank of F in our DANK model is also given in bold.

Method SVM-SMO LogDet BMKL RF DANK

Dataset Setting exact exact scalable exact scalable exact exact(rank.) scalable(rank.)

EEGacc.(%) 95.9 95.9 95.2 94.5 94.1 80.1 96.7(36) 96.3(142)

time(sec.) 8.6 211.6 24.6 1426.3 124.5 2.4 473.6 39.8

ijcnn1acc.(%) 96.5 97.9 97.4 98.5 97.7 93.0 98.9(342) 98.4(1788)

time(sec.) 112.4 8472.3 78.1 109967 1916.9 25.0 28548 571.7

covtypeacc.(%) 96.1 ×1 96.4 × 91.2 79.1 × 97.1

time(sec.) 3972.5 × 5021.4 × 94632 364.5 × 7534.21 These methods try to directly solve the optimization problem on covtype but fail due to the memory limit.

via a pair-wise way. But this work does not consider the bounded constraint and the lowrank structure on F , and utilizes an alternating iterative algorithm to solve the correspondingsemi-definite programming.

• SVM-CV: The SVM classifier with cross validation serves as a baseline.

Experimental Results: Table 1 summarizes the test results of each compared algorithm av-eraged over ten trials. Furthermore, we also apply the paired t-test at the 5% significance level toinvestigate whether the data-adaptive approaches (KNPL and DANK) are significantly better thanother methods. Specifically, we also present the classification accuracy of our DANK model andSVM-CV on the training data to show their respective model flexibilities.

Compared with the baseline SVM-CV, the proposed DANK model significantly improves itsmodel flexibility on diabetic, heart, monks1, spect, and glass in terms of the training accuracy.Accordingly, it is helpful for our model to achieve noticeable improvements on the test data. Onthe monks2, sonar, and wine datasets, SVM-CV has already obtained nearly 100% accuracy on thetraining data, which indicates that the model flexibility is sufficient. In this case, it is difficult for ourDANK method to achieve a huge improvement on these datasets, and accordingly the performancemargins are about 0%∼2%. Compared with other representative kernel based algorithms includingLogDet, BMKL, and RF, the proposedDANK model yields favorable performance.

In general, the improvements on the classification accuracy demonstrate the effectiveness of thelearned adaptive matrix, and accordingly our model has good adaptivity to the training and test data.

6.1.2 RESULTS ON LARGE-SCALE DATASETS

To evaluate the ability of our decomposition scheme on large scale situations, we consider threelarge datasets including EEG, ijcnn1, and covtype to evaluate the kernel approximation performance.

20


DANK KNPL VGG16 SVM-CV RF BMKL LogDet80

85

90

95

92.6891.72

90.8790.35

87.56

90.5591.47

Acc

urac

y(%

)

classification accuracy

Figure 3: Performance of the compared algorithms on CIFAR-10 dataset.

Table 2 reports the dataset statistics (i.e., the feature dimension d, the number of training samples,and the number of test data) and parameter settings including the balance parameter C, the kernelwidth σ, and the number of clusters. Table 3 presents the test accuracy and training time of variouscompared algorithms including LogDet, BMKL, our DANK method, SVM-SMO (Platt, 1998) (thecache is set to 5000) and RF conducted in the following two settings.

In the first setting (“exact”), we attempt to directly test these algorithms over the entire train-ing data. Experimental results indicate that, without the decomposition-based scalable approach,our DANK method achieves the best test accuracy with 96.7% sand 98.9% on EEG and ijcnn1,respectively. However, under this setting, LogDet, BMKL, and our method are just feasible to EEGand ijcnn1, but fail to deal with an extreme large dataset covtype due to the memory limit exceptSVM-SMO and RF.

In the second setting (“scalable”), we incorporate the kernel approximation scheme into LogDet,BMKL, and DANK evaluated on the three large datasets. Experimental results show that, by suchdecomposition-based scalable approach, we can speed up the above three methods. For example,when compared with the direct solution of the optimization problem in the “exact” setting, LogDet,BMKL, and DANK equipped with kernel approximation speed up about 100x, 50x, and 50x onijcnn1, respectively. More importantly, on these three datasets, our DANK method using theapproximation scheme still performs better than SVM-SMO on the test accuracy, which demonstratesthe effectiveness of the proposed flexible kernel learning framework.

From the results in above two settings, we see that our DANK method achieves promising testaccuracy no matter whether the kernel approximation scheme is incorporated or not. Moreover, suchapproximation scheme makes BMKL, LogDet, and our DANK method feasible to large datasets withhuge speedup in terms of computational efficiency.

6.1.3 RESULTS ON CIFAR-10 DATASET

In this section, we test our model on a representative dataset CIFAR-10 (Krizhevsky and Hinton,2009) for natural image classification task. This dataset contains 60,000 color images with the size of32× 32× 3 in 10 categories, of which 50,000 images are used for training and the rest are for testing.In our experiment, each color image is represented by the feature extracted from a convolutionalneural network, i.e., VGG16 with batch normalization (Ioffe and Szegedy, 2015) pre-trained onImageNet (Deng et al., 2009). Then we fine-tune this pre-trained VGG16 model on the CIFAR10dataset with 240 epochs and a min-batch size of 64. The learning rate starts from 0.1 and then is

21


(a) (b)

Figure 4: Comparison between projected gradient method and our Nesterov’s acceleration on heartdataset. The horizontal-axis is the number of iterations. The vertical-axis is the objective functionvalue H(α,F ) in (a) and the dual variable difference ‖α(t) −α(t−1)‖2 in (b).

divided by 10 at the 120-th, 160-th, and 200-th epoch. After that, for each image, a 4096 dimensionalfeature vector is obtained according to the output of the first fully-connected layer in this fine-tunedneural network.

Figure 3 shows the test accuracy (mean±std. deviation %) of the compared algorithms averagedin ten trials. The original VGG16 model with the softmax classifier achieves 90.87±0.54% onthe test accuracy. Using the extracted 4096-D feature vectors, kernel learning based algorithmsequipped with the initial Gaussian kernel are tuned by 5-fold cross validation on a grid of points, i.e.,σ = [0.001, 0.01, 0.1, 1] and C = [1, 10, 20, 30, 40, 50, 80, 100]. In terms of classification perfor-mance, SVM-CV, BMKL, LogDet, and KNPL obtain 90.35±0.18%, 90.55±0.58%, 91.47±0.44%,and 91.72±0.52% accuracy on the test data, respectively. Comparably, our DANK model achievespromising classification accuracy with 92.68±0.41%. More importantly, our DANK model outper-forms SVM-CV with an accuracy margin of 2.33%, and is statistically significant better than theother methods via paired t-test at the 5% significance level. The improvement over SVM-CV on thetest accuracy demonstrates that our DANK method equipped with the introduced kernel adjustmentstrategy is able to enhance the model flexibility, and thus achieves good performance.

6.2 Analysis and Validation for Theoretical Results

6.2.1 CONVERGENCE EXPERIMENTS

To investigate the effectiveness of the used Nesterov’s smooth optimization method, we conducta convergence experiment on the heart dataset when compared to a baseline, i.e., the standardprojected gradient method.

In Figure 4(a), we plot the objective function value H(α,F ) versus iteration by the standardprojected gradient method (in blue dashed line) and its Nesterov’s acceleration (in red solid line),respectively. One can see that the developed first-order Nesterov’s smooth optimization methodconverges faster than the projected gradient method, so the feasibility of employing Nesterov’sacceleration for solving problem (5) is verified. Besides, to further illustrate the convergence ofα(t)∞t=0, we plot ‖α(t) − α(t−1)‖2 versus iteration in Figure 4(b). We find that the sequence

22


60

65

70

75

80

85

90

95

100

Cla

ssifi

catio

n ac

cura

cy(%

)

DANK without the low rank constraint

Figure 5: Influence of the low rank constraint on test classification accuracy.

α(t)∞t=0 yielded by the Nesterov’s acceleration algorithm significantly decays in the first 500iterations, which leads to quick convergence to an optimal solution. Hence, compared with projectedgradient method, the Nesterov’s smooth optimization method is able to efficiently solve the targetedconvex optimization problem in this paper.

6.2.2 VALIDATION FOR THE LOW RANK CONSTRAINT

In large datasets, we do not consider the low rank regularizer on F due to its inseparableproperty for efficient optimization. Here we experimentally specialize in the influence of ‖F ‖∗ onthe test classification accuracy in both small and large datasets.

In terms of small sample case, we choose ten datasets from the UCI database appeared in Section6.1.1 to verify the effectiveness of ‖F ‖∗. Figure 5 shows that, in terms of the test accuracy, apartfrom monks1, monks2 and wine datasets, our method without the low rank regularizer loses about1%∼2% accuracy on the remaining datasets. It indicates that the low rank constraint is important insmall-sample datasets and here we attempt to explain this issue. Without the low rank constraint,each entry in the learned adaptive matrix F can arbitrarily vary in the solution space F . As a result,our model has the O(n2) capability of “scattering” n training data, which would lead to over-fitting.Besides, the learned F might be sophisticated, therefore, it is not easily extended to F ′ for test databy the simple nearest neighbor scheme.

In terms of large sample case, according to Table 3, our DANK model with the “exact” solutionachieves 96.7% and 98.9% accuracy on the EEG and ijcnn1 dataset, respectively. In contrast, afteromitting the low rank regularizer ‖F ‖∗, the test accuracy of our model in the “scalable” settingdecreases to 96.3% and 98.4%, respectively. More importantly, we find that, without the low rankconstraint, the rank of F increases from 36 (in the “exact” setting) to 142 (in the “scalable” setting)on the EEG dataset with 7,490 training samples. This tendency also exhibits on the ijcnn1 datasetwith 49,900 training samples. The rank of F on this dataset increases from rank(F ) = 342 torank(F ) = 1788. So the above results indicate that dropping the low rank constraint leads to a slightdecrease on the test accuracy; while the rank of F is indeed raising but is still much smaller than n.Regarding to the low rank effect of F , one hand, in our experiment, the regularization parameter

23


(a) ‖α∗ − α‖22 (b) ‖F ∗ − F ‖F (c) non-support vectors

Figure 6: Experimental demonstration of the derived bounds in Theorem 7 and 8.

η is chosen as η := ‖α‖22 ∈ O(n). This would be an extreme large value in large sample datasets,resulting in a strong regularization term η‖F −11>‖F. Therefore, F varies around the all-one matrixin a small bounded region and thus shows a low rank effect to some extent. On the other hand, theused Gaussian kernel in large datasets often inherits the rapid decaying spectra (Smola and Scholkopf,2000), e.g., the exponential decay λi ∝ e−ci with c > 0 as illustrated by Bach (2013). Accordingly,the used Gaussian kernel in our DANK model exhibits the exponential decay with finite nonzeroeigenvalues, i.e., a low rank structure, which is helpful to obtain a relative low rank kernel matrixF K due to the rank inequality rank(F K) ≤ rank(F )rank(K). Based on the above analysis,in large sample case, using the bounded constraint η‖F − 11>‖F could be an alternative choicefor the low rank property if we have to drop ‖F ‖∗. This is also demonstrated by our experimentalresults on the EEG and ijcnn1 datasets. When we discard the low rank constraint, the rank of Findeed increases, but the obtained approximation solution F still prohibits the low rank propertywhen compared to the large number of training samples.

From above observations and analyses, we conclude that the low rank constraint in our model isimportant for small-sample cases. In large-sample datasets, due to the separability requirement, ourDANK model has to drop the low rank constraint, and thus achieving a slight decrease on the finalclassification accuracy. Nevertheless, the employed bounded constraint, a strong regularization effectin large sample cases, restricts F to a small bounded region around the all-one matrix, which wouldbe an alternative choice to seek for a low rank matrix F .

6.2.3 VALIDATION FOR OUR DERIVED BOUNDS

Here we show that the derived bounds in Theorem 7 and 8 are tight in practice. To verifythe results of Theorem 7, we randomly select 2,000 samples from the ijcnn1 dataset, and considerdifferent number of clusters v = 2, 5, 10. For each v, the kernel kmeans outputs the data partitionV1,V2, . . . ,Vv and then obtains the solving result ‖α∗ − α‖22 (in blue) and the derived boundB2C2Q(π)B1‖K‖F + 2BC2

B1(in red) as shown in Figure 6(a). We can find that the derived bound is close to

the solving result. And specifically, it is far smaller than the exact bound ‖α∗ − α‖22 ≤ nC2 =2.048× 106.

Similarly, in Figure 6(b), we present the solving result ‖F ∗ − F ‖F (in blue), our derived boundin Theorem 7 (in red), and the exact bound nmax

√B,B (in black), respectively. It can be found

that these three results are close in an absolute sense. Thus, the two panels demonstrate that the data

24


(a) accuracy (b) time cost

Figure 7: Classification accuracy and time cost of various out-of-sample extension based algorithms.

partition scheme can lead to good approximation to the exact solution (α∗,F ∗), and our derivedbounds in Theorem 7 are tight in practical situations.

In Figure 6(c), we validate that the derived condition in Theorem 8 is useful for screening non-support vectors. In our experiment, We also randomly select 2,000 samples from the ijcnn1 dataset,set v = 10, and obtain the approximation solution (α, F ). We compute the gradient

(∇αH(α, F )

)i

associated with each sample and plot the support vectors (in green) and non-support vectors (in blue)as shown in Figure 6(c). Using the condition (21) in Theorem 8, the non-support vectors that satisfy(∇αH(α, F )

)i≤ (B+B2)C

(‖Ki‖1+κ

)= 442.3 can be directly recognized as non-support

vectors (marked in red) without solving the optimization problem. We find that in the total of#SVs=214 (the number of support vectors) and #non-SVs=1786 (the number of non-support vectors),there are 723 non-support vectors recognized by our Theorem 8. That means, over 40% non-supportvectors can be picked out, which demonstrates the effectiveness of our derived bound/condition inTheorem 8.

6.2.4 VALIDATION FOR OUT-OF-SAMPLE EXTENSIONS

Here we attempt to validate the effectiveness of the proposed reciprocal nearest neighbor(NN) scheme. The compared algorithms include the k-NN strategy with k=1 and k=5. Andspecifically, the current state-of-the-art method (Pan et al., 2017) in out-of-sample extension istaken into consideration. It is formulated as a regression problem in a hyper-RKHS spannedby a reproducing hyper-kernel (Ong et al., 2005), i.e., a regression function is learned from thenonparametric kernel matrix K by the developed nonnegative least-squares in this space. Sinceeach element in hyper-RKHS is a kernel function, the learned regression function is also a kernel asexpected, achieving the goal for out-of-sample extension.

Figure 7(a) shows the test classification accuracy under different numbers of training data(n = 100, 200, 300, 400, 500) from the ijcnn1 dataset. It can be observed that our reciprocal nearestneighbor scheme outperforms the standard k-NN strategy but is inferior to (Pan et al., 2017). We canalso find that when fewer data are taken for training (e.g., 100), the nearest neighbor scheme leadsto a large variance. Comparably, our reciprocal nearest neighbor scheme is more robust and thusperforms stably in the sample scarcity case. In terms of time cost, Figure 7(b) shows that these three

25


(a) parameter sensitivity of τ (b) parameter sensitivity of η := xC2

Figure 8: Parameter sensitivity analysis of τ and η.

nearest neighbor schemes achieve similar computational efficiency, but are more efficient than (Panet al., 2017). Admittedly, (Pan et al., 2017) achieves better classification accuracy to some extent,but it is time-consuming when compared to our nearest neighbor based scheme. This is because,learning in hyper-RKHS often needs O(n4) complexity. Instead, the reciprocal nearest neighborscheme only needs O(n2) for out-of-sample extension. Hence, our scheme is simple but effective,achieving a good trade-off between the accuracy and the time cost.

6.2.5 PARAMETER SENSITIVITY ANALYSIS

The objective function (5) in our DANK model contains two regularization parameters τ andη. In our experiments, we empirically set τ = 0.01 and choose η = ‖α‖22, where α is obtained bySVM. Therefore, we discuss whether the choices of them will significantly influence the performanceof the proposed DANK model. Figure 8(a) shows the test accuracy of our algorithm under differentτ values, 0.001, 0.01, 0.1, 0.5, 1 on the diabetic, heart, monks1, monks2, and monks3 datasets. Theclassification results clearly indicate that our DANK model is robust to the variations of τ in a widerange, so it can be easily tuned and we directly set it τ = 0.01 as for practical use.

Moreover, we investigate whether the output of our DANK model is sensitive to the variations ofη := 0.001, 0.01, 0.1, 1, 10, 100C2 on the above five datasets. Our method with η := ‖α‖22 servesas a baseline (in black dash line) as shown in Figure 8(b). It can be found that, the test accuracyachieves the summit at different η on these five datasets, some of which even exceed our adaptive onewith ‖α‖22. Nevertheless, we can set η = 0.1C2 as an alternative choice if we have no prior on α.

6.3 Regression

This section focuses on the proposed DANK model embedded in SVR for regression tasks. Wefirstly conduct the experiments on several synthetic datasets, to examine the performance of ourmethod on recovering 1-D and 2-D test functions. Then, eight regression datasets6 are used to testour model and other representative regression algorithms. The used evaluation metric here is relativemean square error (RMSE) between the learned regression function g(x) and the target label y over

6. The datasets are available at http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/binary.html

26

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html


n data points

RMSE =

∑ni=1

(g(xi)− yi

)2∑ni=1

(yi − E(y)

)2 .6.3.1 SYNTHETIC DATA

Here we test the approximation performance of our method on 1-D and 2-D test functionscompared with a baseline, the SVR with Gaussian kernel. The representative 1-D step function isdefined by

g(s, w, a, x) =

(tanh

(axw − ab

xwc −

a2

)2 tanh

(a2

) +1

2+⌊ xw

⌋)s ,

where s is the step hight, w is the period, and a controls the smoothness of the function g. In ourexperiment, s, w and a are set to 3, 2 and 0.05, respectively. We plot the step function on [−5, 5]as shown in Figure 9(a). One can see that the approximation function generated by SVR-CV (bluedashed line) yields a larger deviation than that of our DANK model (red solid line). To be specific,the RMSE of SVR is 0.013, while our DANK model achieves a promising approximation error witha value of 0.004.

DANK

(a) recovering a step function (b) a 2-D function

(c) recovering by SVR-CV (d) recovering by DANK

Figure 9: Approximation for 1-D and 2-D functions by SVR-CV and our DANK model.

Apart from the 1-D function, we use a 2-D test function to test SVR-CV and the proposedDANK model. The 2-D test function (Cherkassky et al., 1996) g(u, v) ∈ [−0.5, 0.5]× [−0.5, 0.5] isestablished as

g(u, v) = 42.659(

0.1 + (u− 0.5)(g1(u, v) + 0.05

)),

27


Table 4: Comparison results of various methods on UCI datasets in terms of RMSE (mean±std.deviation %). The best performance is highlighted in bold. The RMSE on the training data ispresented by italic, and does not participate in ranking. Notation “•” indicates that the method issignificantly better than other baseline methods via paired t-test.

Dataset (d, n) BMKL NW SVR-CV DANK

Test Test Training Test Training Test

bodyfat (14, 252) 0.101±0.065 0.097±0.010 0.007±0.000 0.101±0.002 0.008±0.000 0.091±0.013pyrim (27, 74) 0.512±0.115 0.604±0.115 0.013±0.010 0.678±0.255 0.007±0.004 0.457±0.143•space (6, 3107) 0.248±0.107 0.308±0.004 0.226±0.144 0.261±0.112 0.106±0.067 0.202±0.114•

triazines (60, 186) 0.725±0.124 0.885±0.039 0.113±0.084 0.815±0.206 0.009±0.012 0.734±0.128cpusmall (12, 8912) 0.133±0.004 0.037±0.009 0.037±0.004 0.104±0.002 0.001±0.000 0.114±0.003housing (13, 506) 0.286±0.027 0.177±0.015• 0.085±0.049 0.267±0.023 0.069±0.013 0.218±0.013

mg (6, 1385) 0.297±0.013 0.294±0.010 0.163±0.076 0.407±0.118 0.134±0.055 0.303±0.058mpg (7, 392) 0.187±0.014 0.182±0.012 0.095±0.087 0.193±0.006 0.061±0.010 0.173±0.008•

where g1(u, v) is defined by

g1(u, v)=(u− 0.5)4− 10(u− 0.5)2(v− 0.5)2+ 5(v− 0.5)4 .

We uniformly sample 400 data points by g(u, v) as shown in Figure 9(b), and then use SVR-CV andDANK to learn a regression function from the sampled data. The regression results by SVR-CV andour DANK model are shown in Figure 9(c) and Figure 9(d), respectively. Intuitively, when we focuson the upwarp of the original function, our method is more similar to the test function than SVR-CV.In terms of RMSE, the error of our method for regressing the test function is 0.007, which is lowerthan that of SVR-CV with 0.042.

6.3.2 REGRESSION RESULTS ON UCI DATASETS

For real-world situations, we compare the proposed DANK model with other representativeregression algorithms on eight datasets from the UCI database. Gonen (2012) extend BMKLto regression tasks, and thus we include it for comparisons. Apart from SVR-CV and BMKL,the Nadaraya-Watson (NW) estimator with metric learning (Noh et al., 2017) is also taken intocomparison. The remaining experimental settings follow with the classification tasks on the UCIdatabase illustrated in Section 6.1.1.

Table 4 lists a brief statistics of these eight datasets, and reports the average prediction accu-racy and standard deviation of every compared algorithm. Moreover, we present the regressionperformance of our DANK model and SVR-CV on the training data to show their respective modelflexibilities. We find that, the proposed DANK model exhibits more encouraging performance thanSVR-CV in terms of the RMSE on the training and test data. From the results on four datasetsincluding bodyfat, pyrim, space, and mpg, we observe that our method statistically achieves thebest prediction performance. On the remaining datasets, our DANK model achieves comparableperformance with BMKL and NW.

7. Conclusion

In this work, we propose an effective data-adaptive strategy to enhance the model flexibilityfor nonparametric kernel learning based algorithms. In our DANK model, a multiplicative scale

28


factor for each entry of the Gram matrix can be learned from the data, leading to an improved data-adaptive kernel. As a result of such data-driven scheme, the model flexibility of our DANK modelembedded in SVM (for classification) and SVR (for regression) is demonstrated by the experimentson synthetic and real datasets. Specifically, this model can be learned in one-stage process withO(1/t2) convergence rate due to the studied gradient-Lipschitz continuous property, where t is theiteration number. That means, we can simultaneously train the SVM or SVR along with optimizingthe data-adaptive matrix by a projected gradient method with Nesterov’s acceleration. In addition,we develop a decomposed-based scalable approach to make our DANK model feasible to largedatasets, of which the effectiveness has been verified by both experimental results and theoreticaldemonstration.

A. Proofs for gradient-Lipschitz continuity

A.1 Proofs of Lemma 3

Proof Since J τ2

is non-expansive (Ma et al., 2011), i.e., for any Ω1 and Ω2

‖J τ2(Ω1)− J τ

2(Ω2)‖F ≤ ‖Ω1 −Ω2‖F ,

with Ω1 = 11> + Γ(α1) and Ω2 = 11> + Γ(α2). Thereby, we have

∥∥F (α1)− F (α2)∥∥

F≤∥∥Γ(α1)− Γ(α2)

∥∥F

=1

4η

∥∥∥diag(α1

>Y)Kdiag

(α1

>Y)−diag

(α2

>Y)Kdiag

(α2

>Y)∥∥∥

F

=1

4η

∥∥∥ diag(α1

>Y +α2>Y

)K diag

(α1

>Y −α2>Y

)∥∥∥F

≤ 1

4η‖K‖F

∥∥∥diag(α1

>Y +α2>Y)∥∥∥

F

∥∥∥diag(α1

>Y −α2>Y

)∥∥∥F

≤ ‖K‖F

4η

∥∥α1 +α2

∥∥2

∥∥α1 −α2

∥∥2,

which yields the desired result.

A.2 Proofs of Theorem 4

Proof The gradient of h(α) in Eq. (6) is computed as

∇h(α) = 1− Y(F (α)K

)Y α . (26)

It is obvious that F (α) is unique over the compact set F . Hence, according to the differentiableproperty of the optimal value function (Penot, 2004), for any α1, α2 ∈ A, from the representation of

29


∇h(α) in Eq. (26), the function h(α) is proven to be gradient-Lipschitz continuous

‖∇h(α1)−∇h(α2)‖2 =∥∥Y (F1 K

)Y α1 − Y

(F2 K

)Y α2

∥∥2

=∥∥Y (F1−F2

)KY α1−Y

(F2 K

)Y(α2−α1

)∥∥2

≤∥∥Y (F1 − F2

)KY α1

∥∥2

+∥∥Y (F2 K

)Y(α2 −α1

)∥∥2

≤∥∥(F1 − F2

)K

∥∥F‖α1‖2 +

∥∥F2 K∥∥

F‖α1 −α2‖2

≤∥∥F1 − F2

∥∥F‖K‖F‖α1‖2 + ‖F2‖F‖K‖F‖α1 −α2‖2

≤‖K‖2F

4η‖α1‖2

(‖α1‖2 + ‖α2‖2

)‖α1 −α2‖2 + λmax(F2)‖K‖F‖α1 −α2‖2

≤‖K‖2F

4η‖α1‖2

(‖α1‖2 + ‖α2‖2

)‖α1 −α2‖2+

(n+

nC2

4ηλmax(K)

)‖K‖F‖α1−α2‖2

≤(n+

3nC2‖K‖2F4η

)‖α1 −α2‖2 ,

(27)

where the third equality holds by ‖A B‖F ≤ Tr(AB>) ≤ ‖A‖F‖B‖F. The fourth equalityadmits due to Lemma 3 and F2 ∈ B. By Lemma 1, we can derive the fifth equality. The lastequality is achieved by 0 ≤ α ≤ C, ‖α‖22 ≤ nC2, λmax(K) ≤ ‖K‖F. The Lipschitz constant is

L = n+3nC2‖K‖2F

4η . Hence, we conclude the proof.

A.3 Proofs of Theorem 11

Proof For any α1, α1, α2, α2 ∈ A, from the representation of ∇αh(α, α) in Proposition 10, wehave

∥∥∥∂h(α, α1)

∂α

∣∣α=α1

− ∂h(α, α2)

∂α

∣∣α=α2

∥∥∥2=∥∥(F1−F2

)K

(α1−α1

)−(F2 K

)(α2−α2−α1+α1

)∥∥2

≤∥∥F1 − F2

∥∥F‖K‖F‖α1 − α1‖2 + ‖F2‖F‖K‖F‖α2 − α2 − α1 + α1‖2

≤(‖K‖2F

4η‖α1 − α1‖2‖α2 − α2 + α1 − α1‖2 + λmax(F2)‖K‖F

)‖α2 − α2 − α1 + α1‖2

≤(

8nC2‖K‖2F4η

+ n+nC2

4ηλmax(K)‖K‖F

)(‖α2 − α1‖2 + ‖α2 − α1‖2

)≤(n+

9nC2‖K‖2F4η

)(‖α2 − α1‖2 + ‖α2 − α1‖2

).

(28)

30


Similarly,∥∥∥∂h(α1,α)

∂α

∣∣α=α1− ∂h(α2,α)

∂α

∣∣α=α2

∥∥∥2

can also be bounded. Combining these two inequalities,we have

‖∇α1h(α1,α1)−∇α2h(α2,α2)‖2 ≤∥∥∥∂h(α, α1)

∂α

∣∣α=α1

− ∂h(α, α2)

∂α

∣∣α=α2

∥∥∥2

+∥∥∥∂h(α1, α)

∂α

∣∣α=α1

− ∂h(α2, α)

∂α

∣∣α=α2

∥∥∥2

≤ 2L(‖α2−α1‖2+‖α2−α1‖2

).

(29)

Thus, we complete the proof.

B. Proofs in large scale situations

B.1 Proofs of Lemma 5

Proof The quadratic term with respect to α in Eq. (17) can be decomposed into

α>Y(F K

)Yα=

v∑c=1

α(c)>Y (c,c)(F (c,c) K(c,c)

)Y (c,c)α(c) ,

and ‖F − 11>‖2F can be expressed as

‖F − 11>‖2F =

n∑i,j=1

(Fij − 1)2 =

v∑c=1

‖F (c,c) − 11>‖2F + Const ,

where the constant is the sum of non-block-diagonal elements of F , and it does not affect the solutionof Eq. (19). Specifically, the positive semi-definiteness on F (c,c) with c = 1, 2, . . . , v still guaranteesthat the whole matrix F is PSD. Besides, the constraint in Eq. (19) is also decomposable, so thesubproblems are separable and independent. As a result, the concatenation of their optimal solutionsyields the optimal solution of Eq. (19).

B.2 Proofs of Theorem 6

Proof We use H(α,F ) to denote the objective function in Eq. (19) with the Gram matrix K. Hence,the optimal solution (α, F ) is a saddle point for the function H(α,F ) due to the max-min problemin Eq. (19). It is easy to check H(α, F ) ≤ H(α, F ) ≤ H(α,F ) for any feasible α and F .

Defining H(α∗,F ∗) and H(α∗,F ∗), we can easily obtain

H(α∗,F ∗)−H(α∗,F ∗)=1

2

n∑i,j:π(xi)6=π(xj)

α∗iα∗jyiyjF

∗ijKij , (30)

with F ∗ij , F∗(xi,xj). Similarly, we have

H(α,F ∗)−H(α,F ∗)=1

2


αiαjyiyjF∗ijKij .

31


Therefore, combining above equations, the upper bound of H(α∗,F ∗)−H(α, F ) can be derivedby

H(α∗,F ∗) ≤ H(α,F ∗)− 1

2


α∗iα∗jyiyjF

∗ijKij

=H(α,F ∗)− 1

2

n∑i,j:π(xi) 6=π(xj)

(α∗iα

∗j − αiαj

)yiyjF

∗ijKij

≤H(α, F )− 1

2

n∑i,j:π(xi) 6=π(xj)

(α∗iα

∗j − αiαj

)yiyjF

∗ijKij

≤H(α, F ) +1

2BC2Q(π) ,

where the first inequality holds by H(α∗,F ∗) ≤ H(α,F ∗) and Eq. (30). The second inequalityadmits by H(α,F ∗) ≤ H(α, F ). Similarly, H(α∗,F ∗)−H(α, F ) is lower bounded by

H(α∗,F ∗) = H(α∗,F ∗)− 1

2


α∗iα∗jyiyjF

∗ijKij

≥ H(α∗, F )− 1

2


α∗iα∗jyiyjF

∗ijKij

= H(α∗, F ) +1

2


α∗iα∗jyiyj(Fij − F ∗ij)Kij

≥ H(α, F ) +1

2


α∗iα∗jyiyj(Fij − F ∗ij)Kij

≥ H(α, F )− 1

2BC2Q(π) ,

which concludes the proof that∣∣H(α∗,F ∗)−H(α,F )

∣∣ ≤ 12BC

2Q(π).


Proof We firstly derive the error bound with respect to α, and then bound ‖F ∗ − F ‖F. We denoteα = α∗ + ∆α, and F = F ∗ + ∆F . Only considering α, we use an equivalent form of Eq. (17),that is

maxα

minF∈Sn+

1>α− 1

2α>Y

(F K

)Y α

s.t. 0 ≤ α ≤ C1, ‖F − 11>‖2F ≤ R2 ,

(31)

which is demonstrated by Eq. (4). The optimality condition of α in Eq. (31) is(∇αH(α∗,F ∗)

)i

= 0 if 0 < α∗i < C,≤ 0 if α∗i = 0,≥ 0 if α∗i = C,

(32)

32


where∇αH(α∗,F ∗) = 1−Y (F ∗K)Y α∗. Since α is a feasible solution satisfying 0 ≤ αi ≤ C,we have (∆α)i ≥ 0 if α∗i = 0 and (∆α)i ≤ 0 if α∗i = C. Accordingly, the following inequalityholds

(∆α)>(1− Y(F ∗ K)Y α∗)=

n∑i=1

(∆α)i(∇αH(α∗,F ∗)

)i≤0. (33)

Next, according to Eq. (33), H(α, F ) in Eq. (31) can be decomposed into

H(α, F ) = H(α∗,F ∗)− 1

2(∆α)>Y (F K)Y (∆α)

+ 1>(∆α)−α∗Y (F ∗ K)Y (∆α)

− 1

2(α∗)>Y(∆F K)Y α∗−(α∗)>Y(∆FK)Y ∆α

≤ H(α∗,F ∗)− 1

2(∆α)>Y (F K)Y (∆α)

− (α∗)>Y (∆F K)Y ∆α .

Further, the above inequality indicates that

1

2(∆α)>Y (F ∗ K)Y (∆α) ≤ H(α∗,F ∗)−H(α, F )− 1

2(∆α)>Y (F K)Y (∆α)

≤ 1

2B2C

2Q(π) +BC2‖K‖F ,

which concludes the bound

‖α∗ − α‖22 ≤B2C

2Q(π)

B1‖K‖F+

2BC2

B1. (34)

Next our target is to bound ‖F ∗ − F ‖F. Defining

F ∗(α∗)=J τ2

(11> +

1

4ηdiag((α∗)>Y )K diag((α∗)>Y )︸︷︷︸

,Γ∗(α∗)

),

F (α)=J τ2

(11> +

1

4ηdiag(α>Y )K diag(α>Y )︸︷︷︸

,Γ(α)

),

we have∥∥F ∗(α∗)− F (α)∥∥

F≤∥∥Γ∗(α∗)− Γ(α)

∥∥F≤∥∥Γ∗(α∗)− Γ∗(α)

∥∥F

+∥∥Γ∗(α)− Γ(α)

∥∥F,

(35)

with Γ∗(α) = 14η diag(α>Y )K diag(α>Y ). The first term in Eq. (35) can be bounded by

Lemma 3. Combining it with the derived bound regarding to ‖α∗ − α‖2 in Eq. (34), we have

∥∥Γ∗(α∗)− Γ∗(α)∥∥

F≤ ‖K‖F

4η

∥∥α∗ + α∥∥

2

∥∥α∗ − α∥∥2≤ ‖K‖F

2η

√nB2Q(π)

B1‖K‖F+

2nB

B1C2 . (36)

33


Next we bound the second term in Eq. (35) as∥∥Γ∗(α)− Γ(α)∥∥

F=

1

4ηdiag(α>Y )(K−K) diag(α>Y )

≤ 1

4η


αiαiyiyjKij

≤ 1

4ηC2Q(π) . (37)

Combining Eqs. (35), (36), and (37), we conclude the proof.


Proof We decompose∇αH(α∗,F ∗) into(∇αH(α∗,F ∗)

)i

=(∇αH(α, F )

)i−(Y (F ∗ K)Y ∆α

)i−(Y (∆F K)Y α∗

)i

+n∑

j:π(xi) 6=π(xj)

αjyiyjF∗ijKij +

n∑j:π(xi)6=π(xj)

α∗jyiyj(∆F )ijKij

≤(∇αH(α, F )

)i+B2κC+(B2−B1)κC +B2C‖Ki‖1+(B2−B1)C‖Ki‖1

≤(∇αH(α, F )

)i+(B+B2)C‖Ki‖1 + (B+B2)κC

≤(∇αH(α, F )

)i+ (B +B2)C(‖K‖1 + κ) ,

where Ki denotes the i-th column of the kernel matrix K. We require(∇αH(α∗,F ∗)

)i≤ 0

when αi = 0. To this end, we need the right-hand of the above inequality is smaller than zero.As a result, we can conclude thatαi = 0 from the optimality condition of problem (17) in Eq. (32).

34


References

Andreas Argyriou, Charles A Micchelli, and Massimiliano Pontil. Learning convex combinationsof continuously parameterized basic kernels. In Proceedings of International Conference onComputational Learning Theory, pages 338–352. Springer, 2005.

Francis Bach. Sharp analysis of low-rank kernel matrix approximations. In Proceedings of Conferenceon Learning Theory, pages 185–209, 2013.

Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions.Journal of Machine Learning Research, 18(1):714–751, 2017.

Francis R Bach. Exploring large feature spaces with hierarchical multiple kernel learning. InProceedings of Advances in Neural Information Processing Systems, pages 105–112, 2008.

Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality,and the smo algorithm. In Proceedings of the International Conference on Machine Learning,pages 1–8, 2004.

Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Kernels as features: On kernels, margins,and low-dimensional mappings. Machine Learning, 65(1):79–94, 2006.

Yoshua Bengio, Jean Francois Paiement, and Pascal Vincent. Out-of-sample extensions for lle,isomap, mds, eigenmaps, and spectral clustering. In Proceedings of Advances in Neural Informa-tion Processing Systems, pages 177–184, 2004.

Byron Boots, Arthur Gretton, and Geoffrey J Gordon. Hilbert space embeddings of predictive staterepresentations. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages92–101, 2013.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge university press, 2004.

Jian-Feng Cai, Emmanuel J Candes, and Zuowei Shen. A singular value thresholding algorithm formatrix completion. SIAM Journal on optimization, 20(4):1956–1982, 2010.

Vladimir Cherkassky, Don Gehring, and Filip Mulier. Comparison of adaptive methods for functionestimation from samples. IEEE Transactions on Neural Networks, 7(4):969–984, 1996.

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels basedon centered alignment. Journal of Machine Learning Research, 13(2):795–828, 2012.

Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz S Kandola. On kernel-target alignment.In Proceedings of Advances in Neural Information Processing Systems, pages 367–373, 2002.

Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Fei Fei Li. Imagenet: A large-scalehierarchical image database. In Proceedins of the IEEE Conference on Computer Vision andPattern Recognition, pages 248–255, 2009.

Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering andnormalized cuts. In Proceedings of ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 551–556. ACM, 2004.

35


Harris Drucker, Christopher JC Burges, Linda Kaufman, Alex J Smola, and Vladimir Vapnik. Supportvector regression machines. In Proceedings of Advances in Neural Information Processing Systems,pages 155–161, 1997.

Michael Fanuel, Antoine Aspeel, Jean-Charles Delvennes, and Johan A.K. Suykens. Positive semi-definite embedding for dimensionality reduction and out-of-sample extensions. arXiv preprintarXiv:1711.07271, 2017.

Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the relaxedk-means. Mathematical Statistics and Learning, 1(3):317–374, 2019.

Mehmet Gonen. Bayesian efficient multiple kernel learning. In Proceedings of the InternationalConference on Machine Learning, pages 1–8, 2012.

Mehmet Gonen and Ethem Alpaydn. Multiple kernel learning algorithms. Journal of MachineLearning Research, 12:2211–2268, 2011.

Steven C. H. Hoi, Rong Jin, and Michael R. Lyu. Learning nonparametric kernel matrices frompairwise constraints. In Proceedings of the International Conference on Machine Learning, pages361–368, 2007.

Cho-Jui Hsieh, Si Si, and Inderjit Dhillon. A divide-and-conquer solver for kernel support vectormachines. In Proceedings of the International Conference on Machine Learning, pages 566–574,2014.

Xiaolin Huang, Andreas Maier, Joachim Hornegger, and Johan A.K. Suykens. Indefinite kernels inleast squares support vector machines and principal component analysis. Applied and Computa-tional Harmonic Analysis, 43(1):162–172, 2017.

Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training byreducing internal covariate shift. In Proceedings of the International Conference on MachineLearning, pages 448–456, 2015.

Prateek Jain, Brian Kulis, Jason V. Davis, and Inderjit S. Dhillon. Metric and kernel learning using alinear transformation. Journal of Machine Learning Research, 13(1):519–547, 2012.

Pratik Jawanpuria, Jagarlapudi Saketha Nath, and Ganesh Ramakrishnan. Generalized hierarchicalkernel learning. Journal of Machine Learning Research, 16(1):617–652, 2015.

S Sathiya Keerthi, Olivier Chapelle, and Dennis DeCoste. Building support vector machines withreduced classifier complexity. Journal of Machine Learning Research, 7(Jul):1493–1515, 2006.

Matthaus Kleindessner and Ulrike von Luxburg. Kernel functions based on triplet comparisons. InProceedings of Advances in Neural Information Processing Systems, pages 6810–6820, 2017.

Vladimir Koltchinskii, Dmitry Panchenko, et al. Empirical margin distributions and bounding thegeneralization error of combined classifiers. The Annals of Statistics, 30(1):1–50, 2002.

Nils M. Kriege, Pierre Louis Giscard, and Richard C. Wilson. On valid optimal assignment kernelsand applications to graph classification. In Proceedings of Advances in Neural InformationProcessing Systems, pages 1623–1631, 2016.

36

http://arxiv.org/abs/1711.07271


Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical report, University of Toronto, 2009.

Brian Kulis, Matyas A Sustik, and Inderjit S Dhillon. Low-rank kernel learning with Bregman matrixdivergences. Journal of Machine Learning Research, 10(Feb):341–376, 2009.

Gert R. G Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan.Learning the kernel matrix with semi-definite programming. Journal of Machine LearningResearch, 5(1):27–72, 2004.

Heng Lian and Zengyan Fan. Divide-and-conquer for debiased `1-norm support vector machine inultra-high dimensions. Journal of Machine Learning Research, 18(1):6691–6716, 2017.

Fanghui Liu, Xiaolin Huang, Chen Gong, Jie Yang, and Li Li. Nonlinear pairwise layer and itstraining for kernel learning. In Proceedings of the AAAI Conference on Artificial Intelligence,pages 3659–3666, 2018.

Gaelle Loosli, Stephane Canu, and Soon Ong Cheng. Learning SVM in Kreın spaces. IEEETransactions on Pattern Analysis and Machine Intelligence, 38(6):1204–1216, 2016.

Zhengdong Lu, Prateek Jain, and Inderjit S. Dhillon. Geometry-aware metric learning. In Proceedingsof the International Conference on Machine Learning, pages 673–680, 2009.

Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and Bregman iterative methods formatrix rank minimization. Mathematical Programming, 128(1-2):321–353, 2011.

Subhransu Maji, Alexander C Berg, and Jitendra Malik. Efficient classification for additive kernelSVMs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):66–77, 2013.

Jovana Mitrovic, Dino Sejdinovic, and Yee Whye Teh. Causal inference via kernel deviance measures.In Proceedings of Advances in Neural Information Processing Systems, pages 6986–6994, 2018.

Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities withlipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAMJournal on Optimization, 15(1):229–251, 2004.

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127–152, 2005.

Yung-Kyun Noh, Masashi Sugiyama, Kee-Eung Kim, Frank Park, and Daniel D Lee. Generativelocal metric learning for kernel regression. In Proceedings of Advances in Neural InformationProcessing Systems, pages 2449–2459, 2017.

Cheng Soon Ong, Xavier Mary, and Alexander J. Smola. Learning with non-positive kernels. InProceedings of the International Conference on Machine Learning, pages 81–89, 2004.

Cheng Soon Ong, Alexander J Smola, and Robert C Williamson. Learning the kernel with hyperker-nels. Journal of Machine Learning Research, 6(Jul):1043–1071, 2005.

37


Binbin Pan, Wen Sheng Chen, Bo Chen, Chen Xu, and Jianhuang Lai. Out-of-sample extensions fornon-parametric kernel methods. IEEE Transactions on Neural Networks and Learning Systems,28(2):334–345, 2017.

Jean-Paul Penot. Differentiability properties of optimal value functions. Canadian Journal ofMathematics, 56(4):825–842, 2004.

John C. Platt. Sequential minimal optimization: a fast algorithm for training support vector machines.In Advances in Kernel Methods-Support Vector Learning, pages 212–223, 1998.

Wei Qian, Yuqian Zhang, and Yudong Chen. Structures of spurious local minima in k-means. arXivpreprint arXiv:2002.06694, 2020.

Danfeng Qin, Stephan Gammeter, Lukas Bossard, and Till Quack. Hello neighbor: Accurate objectretrieval with k-reciprocal nearest neighbors. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 777–784, 2011.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedingsof Advances in Neural Information Processing Systems, pages 1177–1184, 2007.

Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linearmatrix equations via nuclear norm minimization. SIAM Review, 52(3):471–501, 2010.

Frank Michael Schleif and Peter Tino. Indefinite proximity learning: a review. Neural Computation,27(10):2039–2096, 2015.

Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines,regularization, optimization, and beyond. MIT Press, 2003.

John Shawe-Taylor and Nello Cristianini. Support vector machines, volume 2. Cambridge UniversityPress Cambridge, 2000.

Qinfeng Shi, Chunhua Shen, Rhys Hill, and Anton Van Den Hengel. Is margin preserved afterrandom projection? In Proceedings of the International Coference on International Conferenceon Machine Learning, pages 643–650, 2012.

Si Si, Cho-Jui Hsieh, and Inderjit S Dhillon. Memory efficient kernel approximation. Journal ofMachine Learning Research, 18:1–32, 2017.

Aman Sinha and John C Duchi. Learning kernels with random features. In Proceedins of Advancesin Neural Information Processing Systems, pages 1298–1306. 2016.

Alex J Smola and Bernhard Scholkopf. Sparse greedy matrix approximation for machine learning.In Proceedings of the International Conference on Machine Learning, pages 911–918, 2000.

Nathan Srebro and Shai Ben-David. Learning bounds for support vector machines with learnedkernels. In Proceedings of the International Conference on Computational Learning Theory, pages169–183. Springer, 2006.

Ingo Steinwart and Christmann Andreas. Support Vector Machines. Springer Science and BusinessMedia, 2008.

38

http://arxiv.org/abs/2002.06694


George P. H. Styan. Hadamard products and multivariate statistical analysis. Linear Algebra and ItsApplications, 6:217–240, 1973.

Philip HS Torr. Locally linear support vector machines. In Proceedings of the InternationalConference on Machine Learning, pages 1–8, 2011.

Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

Manik Varma and Bodla Rakesh Babu. More generality in efficient multiple kernel learning. InProceedings of the International Conference on Machine Learning, pages 1065–1072, 2009.

John Von Neumann. On rings of operators. reduction theory. Annals of Mathematics, pages 401–485,1949.

Shusen Wang, Luo Luo, and Zhihua Zhang. SPSD matrix approximation vis column selection:theories, algorithms, and extensions. Journal of Machine Learning Research, 17(1):1697–1745,2016.

Andrew Gordon Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discoveryand extrapolation. In Proceedings of the International Conference on Machine Learning, pages1067–1075, 2013.

Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression. InProceedings of Conference on Learning Theory, pages 592–617, 2013.

Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. Reidentification by relative distance comparison.IEEE transactions on Pattern Analysis and Machine Intelligence, 35(3):653–668, 2012.

Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification withk-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 1318–1327, 2017.

Jinfeng Zhuang, Ivor W Tsang, and Steven C. H Hoi. A family of simple non-parametric kernellearning algorithms. Journal of Machine Learning Research, 12(2):1313–1347, 2011.

39

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Learning Data-adaptive Nonparametric Kernels · 2020-03-31 · Learning Data-adaptive Nonparametric...

Documents