CAKE: Convex Adaptive Kernel Density...

498

CAKE: Convex Adaptive Kernel Density Estimation

Ravi Ganti Alexander GrayCollege of Computing (CSE), GeorgiaTech

Abstract

In this paper we present a generalizationof kernel density estimation called Con-vex Adaptive Kernel Density Estimation(CAKE) that replaces single bandwidth se-lection by a convex aggregation of kernelsat all scales, where the convex aggregationis allowed to vary from one training pointto another, treating the fundamental prob-lem of heterogeneous smoothness in a novelway. Learning the CAKE estimator given atraining set reduces to solving a single con-vex quadratic programming problem. We de-rive rates of convergence of CAKE like es-timator to the true underlying density un-der smoothness assumptions on the class andshow that given a sufficiently large samplethe mean squared error of such estimators isoptimal in a minimax sense. We also give arisk bound of the CAKE estimator in termsof its empirical risk. We empirically compareCAKE to other density estimators proposedin the statistics literature for handling het-erogeneous smoothness on different syntheticand natural distributions.

1 Introduction

The problem of density estimation is as follows: Givenn i.i.d. points x1, . . . , xn sampled from the distributionwith a density function f , the task is to construct adensity estimator f : Rd × (Rd)n → R which provablyconverges to the true underlying density function f ina suitable sense. Accurate density estimation allowsone to build accurate classifiers, regressors and also fa-cilitates data visualization. Parametric approaches todensity estimation e.g. fitting a mixture of Gaussians

Appearing in Proceedings of the 14th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR:W&CP 15. Copyright 2011 by the authors.

to the data with the expectation-maximization algo-rithm (Bishop et al., 2006) require strong parametricassumptions on the true underlying density functionwhich are seldom known. A variety of non-parametricapproaches for density estimation such as kernel den-sity estimation (KDE) and wavelet based methods ex-ist. Non parametric methods require only the weakerassumption that the underlying function that we aretrying to recover belongs to a smooth class of func-tions, and hence are suitable in many domains whereno knowledge of appropriate parametric form is avail-able.

Kernel density estimation (KDE) (Parzen, 1962) isthe most popular non-parametric method for densityestimation in part because other approaches such aswavelets do not extend well beyond one or two dimen-sions. KDE involves fitting smoothing kernels, whichis a symmetric probability density function (PDF), atthe different training points. The density at a point xis then simply the sum of kernel contributions due toall training points xi at x. This yields an estimator ofthe form

f (x) = 1nhd

∑ni=1 k

(x−xi

h

). (1)

k is a smoothing kernel that is chosen priori. Examplesof smoothing kernels include Gaussian and Epanech-nikov kernels. The task in KDE is to estimate thebandwidth h. Some common approaches to estimat-ing h include maximizing the leave-one-out likelihoodcross validation and minimizing least squares cross val-idation error (LSCV) (An excellent survey of band-width selection methods can be found in (Jones et al.,1996), (Hall et al., 1995)). Though kernel density es-timators are consistent (Tsybakov, 2009) they are notgood at modeling distributions which have spatiallyvarying smoothness. This affects the problem visuallyand also leads to slower rates of asymptotic conver-gence. In this paper we will present a generalizationof KDE called Convex Adaptive Kernel Density Esti-mation (CAKE). The basic idea of CAKE is to usea set of base kernels with different bandwidths to fitkernels at different training points by a convex aggre-gation (CA) of these base kernels. However, the trickis to allow this CA’s to vary across the different train-ing points. This turns out to be equivalent to fitting

499


a kernel at the different training point with the band-width being a function of the coefficients of the CA,the training point and the test point where we need toestimate the density. By doing so we are able to learna density estimator that adapts well to varying levelsof smoothness of the true density.

Previous Work. One of the proposed techniquesto solve the problem of learning density functionswith spatially varying levels of smoothness is to learnthe optimal smoothing kernel by solving a variationalproblem that minimizes the variance or mean inte-grated squared error (MISE) of the density estimatesvia Legendre’s polynomials (Gasser et al., 1985). Suchkernels are data independent and are difficult to gen-eralize to higher dimensions. In the “variable kernel”density estimation method (VKDE) the bandwidthvaries with the pth nearest neighbour of the trainingpoints to the remaining n− 1 training points. VKDEwith smoothing parameter h is defined as

f (x) = 1n

∑ni=1

1(hdp,i)d k

(x−xi

hdp,i

). (2)

where dp,i is the distance of the pth nearest neighbourof xi in the dataset, and h is a universal smoothingfactor. Estimators obtained by VKDE are probabil-ity density functions (PDF), and inherit the smooth-ness properties of the kernel k, but still are not verygood at capturing complex distributions and requireoptimization over a continuous variable h and a dis-crete variable p. Nearest neighbour KDE (NNKDE)fits kernels of bandwidth h at all training points whereh varies with the pth nearest neighbour of the test pointx where we need to estimate the density. NNKDE isknown to exhibit rough tails (Silverman, 1986) that arediscontinuous and do not necessarily integrate to 1. Ageneralization of the “variable kernel” method whichis known to work well for 1-d problems is the adaptivekernel density estimation (AKDE) method (Breimanet al., 1977), where the width of the kernel varies ac-cording to the training points. AKDE works by firstchoosing a bandwidth h, to get a pilot density estimateat different training points, and then scales the band-width of each training point by θi, giving larger band-widths at training points where the density is smalland smaller bandwidths where the density is large.The estimator is

f (x) = 1n

∑ni=1

1(hθi)d k

(x−xi

hθi

). (3)

where θi is the local bandwidth scaling factor. Giro-lami et al (Girolami & He, 2003) proposed the reducedset density estimator (RSDE) where the kernel contri-butions due to the different training points were scaledby different weights and the density is estimated as

f(x) =∑n

i=1γi

hd k(

x−xi

h

). (4)

The weights γi were then learnt by solving a convex QPwhich minimizes the ISE under convexity constraints.Due to the structure of the optimization problem thevector γ turns out to be sparse, and hence only thereduced set (non-zero γi values) matters. However it isclear that for points that are far away from the reducedset the density estimate of RSDE is an underestimate.

Support vector density estimation (Vapnik & Mukher-jee, 1999) fits a cdf to the sampled data by solvingan optimization problem which minimizes the `1distance between the estimator of cumulative dis-tribution function and its empirical counterpart.Work on similar lines has been done in (Song et al.,2008; Shawe-Taylor & Dolia, 2007). However, thesemethods to our knowledge lack theoretical resultson the bias, variance and consistency of such es-timators. Devroye et al (Devroye & Lugosi, 2000)investigated learning kernel density estimators in anL1 framework. They proposed the double kernelmethod where a pair of kernels k, l are used to learnthe bandwidth h which is provably universal. Thoughpromising, to our knowledge no empirical work hasbeen done in this framework. Liu et al proposedthe RODEO density estimator (Liu et al., 2007) 1

to fit high dimensional distributions. They learn asemi-parametric density estimator where in order toestimate the density at a point x, product kernelsare fitted at the different training points and theseproduct kernels are a product of d univariate kernelsalong different dimensions with different bandwidths.The bandwidths along the different dimensions arelearnt by using a test statistic which compares themagnitude of the derivative of the density estimatesalong different dimensions to the variance of thedensity estimate. The resulting estimator provablyachieves better rates of convergence than KDE, undercertain sparsity assumptions.

Our Contributions. The problem of learningan optimal Mercer kernel has been of recent interestin the kernel machines community (Ong et al.,2005; Lanckriet et al., 2004). In this paper we bridgethe two distinct methods of density estimation andkernel machines and show that by appropriatelylearning smoothing kernels for the problem of den-sity estimation one can achieve the desired goalof learning density estimators that exhibit varyinglevels of smoothness in different regions of the spaceso that even complex distributions can be modeledwell. With this goal in mind we propose the ConvexAdaptive Kernel Density Estimation method (CAKE)

1In this paper we shall concern ourselves with localrodeo with a uniform density as the baseline density andKDE as the non-parametric component of the density es-timator.

500

Ravi Ganti, Alexander Gray

Figure 1: This figure outlines the fundamental difference between the shape of the kernels fitted by different kernel based density estimation methodsby AKDE/VKDE (leftmost), RODEO (middle plot), and CAKE (rightmost plot).

method (Section 3). In CAKE a set of base kernelsK(|K| = m = O(1)) is used to learn smoothing kernelsat different training points by aggregating them in aconvex way. However, the trick is to let these convexaggregations change from one training point to theother. The CAKE density estimator can be writtenas:

f(x) = 1n

∑ni=1

∑mj=1

αij

hdj

k(

x−xi

hj

)(5)

where α ∈ Rnm ≥ 0, ∀i = 1, . . . ,m :∑m

j=1 αij = 1,where h1, . . . , hm are bandwidths of the m base ker-nels and are assumed to be known. The constants αij

are learnt by minimizing the regularized LSCV scoreof f . The LSCV (Section 2) of f is a surrogate ofthe integrated squared error (ISE) of f and can becalculated using the training set. Minimizing the reg-ularized LSCV of the density estimator f reduces tosolving a quadratic programming problem (QP) overnm variables (Section 3) which can be efficiently solvedusing a simple variation of the SMO algorithm. Theunique power of CAKE as a density estimator (seeFigure (1)) stems from the fact that CAKE estimatesdensities by placing kernels of different bandwidthsat different training points (like AKDE and VKDE).In addition, these bandwidths depend on the pointx where we need to estimate density (like NNKDE).However, unlike AKDE and RODEO the learning pro-cess involves minimization of a risk function, which isthe L2 distance between the estimator f and the truedensity function f . We also show connections to theliterature on optimal aggregation of estimators (Ne-mirovski, 2000; Tsybakov, 2003) and demonstrate howthe CAKE is more than a simple convex aggregationof kernel density estimators.

We analyze the MSE of the CAKE like density estima-tors (Section (4)) and examine its optimal value andshow that given enough data (as a function of the trueunderlying density and the dimensionality) the MSEof CAKE like estimators for the densities in Holderclass Σ(β, L) is O(n−

2β2β+d ). We also provide a bound

on the L1 risk of CAKE in terms of empirical L1 errorusing stability arguments. To our knowledge this isthe first time stability arguments have been used fordensity estimation problems (Section (4)).

We empirically compare our density estimator toRSDE, RODEO, AKDE and VKDE on various syn-thetic and natural datasets (Section (5)). In orderto evaluate our density estimator on high dimensionaldata we use the CAKE density estimator to learn asmoothing kernel based classifier and compare it tosmoothing kernel based classifiers learnt using otherdensity estimation methods.

Notation. Vectors are represented in lower case let-ters and matrices in upper case. We shall use doubleindexing for vectors of sizes nm. e.g if v ∈ Rnm thenvpq refers to the (n(p− 1) + q)th element of the vectorv. 1n refers to a vector of all 1’s of size n, and 0m

refers to the vector of all 0’s of size m.

2 L2 Error of a Density Estimator andits Surrogate

We would like to minimize the L2 error of an estima-tor also known as the integrated squared error (ISE).One of the prime motivations for choosing ISE as ourobjective over other objectives such as likelihood is itsrobustness to outliers (Silverman, 1986). Given anyestimator f of the underlying density function the ISEof the estimator is

ISE(f) =∫

(f(x)− f(x))2 dx (6)

Since f2 is independent of f , hence minimizing ISE(f)is equivalent to minimizing

LSCV (f) =∫

f2 dx− 2∫

ff dx. (7)

Define f−i(x) to be the density estimate at xwithout taking into account the kernel contribu-tion of the training point xi, so that f−i (xi) =

501


1n−1

1hd

∑nj=1j 6=i

k(

xi−xj

h

). We have

LSCV (f) =∫

f2 − 2n

∑ni=1 f−i(xi) (8)

Hence LSCV (f) gives a data-dependent estimator ofthe ISE. By using a smoothing kernel function sayGaussian kernel with an unknown bandwidth h, onecan cross-validate for the optimal bandwidth thatgives the smallest value for LSCV. A strong large-sample justification for using the ISE comes fromStone’s result (Stone, 1984) which states that asymp-totically minimizing LSCV (f) is equivalent to mini-mizing

∫Rd(f − f)2 dx over all h.

3 Convex Adaptive Kernel DensityEstimation Method

The CAKE estimator uses a set of finite number ofbase kernels and fits a kernel at each training pointthat is a convex combination of base kernels. How-ever, this convex combination is allowed to vary fromone training point to the other. Let K be a set offinite number of smoothing kernels with known band-widths h1, . . . , hm, where m = O(1) (these bandwidthscould have been pre-learnt using the training datasetor could have been provided by an oracle). The CAKEdensity estimator can be written as

f(x) =1n

n∑i=1

m∑j=1

αij

hdj

k

(x− xi

hj

). (9)

The problem now reduces to learning the weights α.Our aim is to minimize the L2 regularized LSCV ofthe density estimator f . Using equations (8,9) we get

LSCV (f) ≈∫

Rd

1n2

n∑i=1

m∑j=1

αij

hdj

k

(x− xi

hj

)2

dx−

2n2

n∑i=1

n∑p=1p6=i

m∑j=1

αpj

hdj

k

(xi − xp

hj

).

(10)

Define Z ∈ Rnm×nm, α, v ∈ Rnm×1 as

Z[ij, pl] =∫

1n2hd

jhdl

k

(x− xi

hj

)k

(x− xp

hl

)dx

v[ij] =1n2

n∑p=1p6=i

1hd

j

k

(xi − xp

hj

). (11)

Hence from Equation (10-11) minimizing L2 regular-ized LSCV can be cast as the following optimization

problem:

P : minα

αT Zα− 2αT v + λ||α||22 (12)

subject to:m∑

j=1

αij = 1 ∀i = 1, . . . n, α ≥ 0 (13)

where the constraints (13) ensure that Equation (9)indeed is a legal density estimator. It is easy to seethat Z � 0 and hence the optimization problem (12-13) is a convex QP over nm variables. In order tobe able to efficiently solve this QP we can reusethe standard SMO (Keerthi et al., 2001) with theconstraint that both the working set variables shouldcome from the same “block” 2 of the α vector overwhich the convexity constraints have been defined.

Relation to aggregation of density estima-tors. A closely related body of literature is that ofoptimal aggregation of estimators in least squaresregression (Nemirovski, 2000; Tsybakov, 2003) wheregiven m regression estimators the task is to learnan optimal aggregation of these estimators w.r.t acertain model defined by the m estimators. Popularmodels include convex hull, linear span of the mestimators and the original set of m estimators. Thefocus has been that of designing estimators whoseexcess risk w.r.t the optimal aggregation in the modelis small. Analogously Rigollet et al (Rigollet & Tsy-bakov, 2007) have investigated optimal aggregationof density estimators. In this work the authors learnthe best linear/convex combinations of given basedensity estimators that minimizes the expected ISE.As an example they consider the case when the mdensity estimators are all kernel density estimatorswith Gaussian kernels of different bandwidths. Nowif we place additional restrictions on αij ’s so that∀i, j : αij = γj and

∑mj=1 γj = 1 then the optimization

problem proposed in Equation (12-13) along withthese new additional restrictions tries to find anoptimal estimator (in the ISE sense) in the convexhull defined of the density estimators fn,1, . . . fn,m,where fn,1, . . . fn,m are the m kernel density estima-tors defined by the m base kernels with bandwidthsh1, . . . hm. However, without the above mentionedrestrictions on α vector, our model is richer thanthe convex aggregation of fn,1, . . . fn,m, and is insome sense a “local convex ” aggregation of the theseestimators.

2Here α vector can be seen as having n blocks of sizem each and the convexity constraints in Equation (13) areonly over each block of α vector and not across the blocks.

502


4 MSE for CAKE like densityestimators and L1 risk of CAKE

We are interested in analyzing the MSE and its opti-mal value for CAKE like density estimator at a pointx0, where MSE(x0)

def= E[f(x0) − f(x0)]2. By CAKElike density estimators we mean estimators of the type

f(x) =1n

n∑i=1

m∑j=1

αijk(x− xi

hj

)(14)

where αij ≥ 0,∑m

j=1 αij = 1 are fixed constants. Weneed the following definitions (Nemirovski, 2000) andassumptions.Definition 1. Let β, L > 0. The Holder class Σ(β, L)is defined as the set of all functions f : [0, 1]d → Rwhich are l = bβc times differentiable and the deriva-tives satisfy

|Dlf(x)[h, . . . h︸︷︷︸l times

]−Dlf(x′)[h, . . . h︸︷︷︸l times

]|

≤ L|x− x′|β−l|h|l ∀x, x′ ∈ [0, 1]d, h ∈ Rd

where bβc is the greatest integer strictly less than β.Definition 2. Let l ≥ 1 be an integer. We say that akernel k : Rd → Rd has order l if ∀j1, . . . jd ≥ 0 suchthat

∑di=1 ji ≤ l we have∫

u∈Rd

k(u) du = 1,

∫u∈Rd

uj11 uj2

2 . . . ujd

d k(u) du = 0

If d = 1, then the above condition becomes∫u∈R k(u) du = 1,

∫u∈R ujk(u) du = 0 ∀j = 1 . . . l.

Assumption 1 (A1). The set K has smoothing ker-nels whose bandwidths hj ∀j = 1, . . . ,m satisfy theconstraint hj1

hj2= cj1j2 ∀j1, j2 = 1 . . .m where 0 <

cj1j2 < ∞ and hj → 0 as n →∞ ∀j = 1 . . .m.Assumption 2 (A2). The true density function f be-longs to the Holder class Σ(β, L) and all the base ker-nels are of order l = bβc . Also C1

def=∫

Rd k2(θ) dθ <

∞, C2def=

∫Rd |θ|βk(θ) dθ < ∞.

Assumption A1 guarantees that as we see more andmore samples the bandwidths all tend to 0 at the samerate. Assumption A2 is satisfied for most commonlyused smoothing kernels such as a Gaussian, Epanech-nikov kernels. Our main result is that given a largeenough sample from the distribution, the optimal MSEof the CAKE like estimator is O(n−

2β2β+d ) which is

known to be optimal in a minimax sense for the Holderclass of densities Σ(β, L) (Tsybakov, 2009). The proofof Lemma (1) is similar to the proofs of Propositions(1.1, 1.2) in (Tsybakov, 2009). 3

3Due to lack of space we have postponed the full proofsto the supplementary material.

Lemma 1. Consider the CAKE like density estima-tor as shown in Equation (14). Let b(x0) and σ2(x0)denote the bias and variance of CAKE like density es-timators. Then the estimator f under assumptions

A1, A2 satisfies: σ2(x0) ≤ C1fmax

n2

∑ni=1

∑mj=1

α2ij

hdj

,

|b(x0)| ≤ C2Lnl!

∑ni=1

∑mj=1 αijh

βj , MSE(x0) ≤ αT Mα

where fmax is the maximum value of the underlyingdensity and C3 = 1

n2 (C2Ll! )2, C4 = C1fmax

n2 and | · | is thestandard Euclidean norm on Rd, and M ∈ Rnm×nm isdefined as

M [ij, pl] =

{C3h

2βj + C4

hdj

if i = p and j = l

C3hβj hβ

l otherwise.(15)

Lemma 2. Consider the optimization problem

P1 : minα∈Rnm×1

αT Mα

subject to :m∑

j=1

αij = 1 ∀i = 1, . . . , n, α ≥ 0.

Under assumptions A1, A2 and for n ≥n0(fmax, β, d, L) the optimal value of the objec-tive is 1T

n (AM−1AT )−11n, where A ∈ Rn×nm

and the rth row of the matrix A is given by[0m, . . . ,0m︸︷︷︸

r−1 times

,1m, 0m, . . .0m︸︷︷︸nm−r times

]T . Also the optimal

value of MSE(x0) = O(n−2β

2β+d ) is attained whenhj = Θ(n−

12β+d ).

Proof Sketch. Let P4 be the optimization problem P1but without the positivity constraints. Lemmas (4-9) inthe supplement establish the equivalence of problemsP1 and P4 under assumption A1, and large enoughn. The solution of problem P4 is derived in Lemma(10) using Lagrangian. The second part of the proofrequires rewriting the upper bound on MSE as 1T

nB1n

for an appropriate matrix B (Lemma (11)), followedby a spectral analysis in Lemmas (12-15).Theorem 3. Under assumptions A1, A2 and hj =Θ(n−

12β+d ), ∀n ≥ n0(fmax, β, d, L) the CAKE estima-

tor satisfies

supx0∈Rd

supf∈Σ(β,L)

EDn

[(f(x0)− f(x0))2

]= O(n−

2β2β+d ).

Proof. It is enough to prove that ∀f ∈ Σ(β, L), x0 ∈Rd : fmax < C < ∞, for some universal C. Thenthe result follows from Lemma (2). Choose boundedsmoothing kernels with hj = 1. Now we have from theLemma (1) that

f(x0) ≤C2L

l!+

∫K(x−z)f(z) dz ≤ C2L

l!+Kmax < ∞.

Since the R.H.S. is independent of f, x0, one canchoose C to be the R.H.S. of the above equation. 2

503


Our next result is regarding the L1 risk of CAKE interms of its empirical L1 risk.

Theorem 4. Suppose we are given fixed bandwidthsh1, . . . , hm, and the underlying density function f isbounded by a constant B. Let cd = (

√2π)d. Then with

probability at least 1 − δ over the input training sam-ples, the CAKE estimator f given by the problem (12-13) with Gaussian base kernels satisfies the risk bound

Ex∼D|f(x)− f(x)| ≤ 1n

n∑i=1

|f(xi)− f(xi)|+

[4cd

[ m∑j=1

1hd

j

+2√

2√

cd

√λ

√√√√ m∑j=1

1h2d

j

√√√√ m∑j=1

m∑l=1

1

(√

h2j + h2

l )d

]

+ B +m∑

j=1

1cdhd

j

]√ln( 1

δ )2n

. (16)

Proof Sketch. The proof proceeds by bounding theuniform stability of CAKE w.r.t the loss function|f(x) − f(x)| in Lemma (16). Then applying Theo-rem (15) we get the desired result.

5 Empirical Results

We implemented all the algorithms in C++ as apart of the open source machine learning toolboxMLPACK (Gray et al., 2009) and compared theestimators on both 1-d and 2-d, synthetic and naturaldatasets. Marron and Wand (Marron & Wand,1992) proposed a set of 15 synthetic distributionsas a testbed. These mixtures have varying levelsof smoothness and modality and serve as an idealbenchmark for us to compare the different densityestimators. Due to lack of space we shall investigatethe performance of CAKE on four synthetic datasetssampled from skewed unimodal density (SUD), outlierdensity (OD), bimodal density (BD), trimodal density(TD), and the two famous natural datasets namelyOld Faithful geyser dataset and suicide dataset (Sil-verman, 1986; Sain, 1994). Full experimental resultscan be found in the supplementary material. Suicidedataset has measurements of duration of hospitaliza-tion of attempted suicide patients. Two versions of theOld Faithful dataset are available. The first versionhas 107 observations (1-d) measuring the eruptionlength, while the second version has 272 observations(2-d) of both the eruption length as well as thewaiting time between eruptions. We used both thesedatasets for 1-dimensional and multi-dimensionalexperiments respectively. For our experiments with1-d distributions we sampled 1600 points for trainingand tested the final density estimators on another800 points sampled from the same distribution.

To learn the CAKE estimator we used a set of 10base kernels (all Gaussians) with bandwidths in therange [hp

3 , 3hp] where hp is the plugin bandwidth

calculated using the equation hp =(

4d+2

) 1d+4

σn−1

d+4

where σ is the empirical standard deviation. Thebandwidth parameter h used in AKDE, VKDEand RSDE, and the regularization parameter λused in CAKE were all found by cross-validation.The parameters for local RODEO were chosen ascn = log(d), c0 = range of training data, β = 0.9.These are the settings that were used by Liu et al intheir paper (Liu et al., 2007). Sain (Sain, 1994)observes that the eruption length of the Old Faithfuldataset has 2 modes of approximately equal heightseparated by a smooth valley, while the Suicidedataset has a unimodal distribution with a long tail.We report the RMSE of different density estimatorsin Table (2), and show plots of the different densityestimators in Figure (2). It is clear from the plotsthat RSDE tends to over-smooth the distributionsand RODEO fails to capture multiple modes in adistribution and gives very rough density estimates inthe tails and valleys. CAKE tends to give smootherestimates than AKDE but at the same time capturesall the features of the distribution well. VKDE isgenerally seen to give noisy density estimates. Onthe Suicide dataset all the density estimators exceptVKDE show a unimodal structure with a long tail.However, RODEO shows heavy tails and RSDEflattens out the main mode. On the other handAKDE exaggerates the size of the main mode. Thetail behaviour of AKDE and CAKE are better thanthe other estimators. On the Old Faithful datasetboth AKDE and CAKE show a bimodal structure,with CAKE capturing the property of equal modesize better than AKDE. RODEO completely smoothsthe first mode. RSDE on both these datasets givesultra smooth density estimates without showing theimportant features of the distribution. As can beseen from Figure (2) the regularization term helpslearn smoother density estimates. The presence ofthe regularization term is especially important in ourproblem formulation, because unlike the data-splittingscheme used in aggregation of estimators (Nemirovski,2000) the same training sample is being used tolearn the base kernel bandwidths (by calculatingthe plug-in bandwidth) and also the final CAKEdensity estimator. For multidimensional experimentthe dataset was whitened for computational purposes.Whitening the dataset is equivalent to working withthe original data with the bandwidth of the kernelchosen according to the covariance matrix of thedistribution (p.78 of (Silverman, 1986)). On themulti-dimensional version of the Old Faithful dataset

504


−3 −2 −1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7Skewed Unimodal Density

AKDECAKERSDERODEOVKDETrue

−3 −2 −1 0 1 2 30

0.5

1

1.5

2

2.5

3

3.5

4Outlier Density


−3 −2 −1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6Bimodal Density


−3 −2 −1 0 1 2 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Trimodal Density


1.5 2 2.5 3 3.5 4 4.5 50

0.2

0.4

0.6

0.8

1

1.2

1.4Old Faithful (1−d)

AKDECAKERSDERODEOVKDE

0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

6

7Suicide

AKDECAKERSDERODEOVKDE

−3 −2 −1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7Skewed Unimodal Density

Points

Dens

ity E

stim

ates

TrueCAKECAKEWITHNOREG

−3 −2 −1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35Trimodal Density

Points

Dens

ity E

stim

ates


−3 −2 −1 0 1 2 3 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7Trimodal Density

Points

Dens

ity E

stim

ates


Figure 2: Performance of AKDE, RSDE, CAKE and RSDE on synthetic and natural datasets. The Old Faithful (1-d)models the distribution of eruption lengths of the Old Faithful Geyser. The Suicide dataset models the distributionof(scaled) length of hospitalization of an attempted suicide patient. The old faithful dataset can be obtained fromwww.stat.cmu.edu/ larry/all-of-statistics/=data/faithful.dat. The Suicide dataset can be obtained from (Silverman,1986).The plots in the last row show the impact of regularization on the smoothness of the CAKE density estimator.

505


−4−2

02

4

−4−2

02

40

0.2

0.4

0.6

0.8

Eruption LengthWaiting Time

Dens

ity E

stim

ate

by A

KDE

−4−2

02

4

−4−2

02

40

0.1

0.2

0.3

0.4

0.5


Dens

ity E

stim

ate

by C

AKE

−4−2

02

4

−4−2

02

40

0.2

0.4

0.6

0.8


Dens

ity E

stim

ate

by V

KDE

−4−2

02

4

−4−2

02

40

0.5

1

1.5

2

2.5


Dens

ity E

stim

ate

by R

SDE

−4−2

02

4

−4−2

02

40

2

4

6


Dens

ity E

stim

ate

by R

odeo

Figure 3: Performance of density estimators on the multi-dimensional Old Faithful dataset.

Dataset Train/Test size CAKE AKDE Rodeo SVMBanana 400/4900 86.25 85.4 62.82 88.70Flare Solar 666/400 65.0 55.0 3.25 66.50Twonorm 400/7000 93.27 96.51 50 97.04Heart 169/99 79.0 45.0 32.00 82.0Titanic 149/2050 74.12 74.11 17.21 74.12Ringnorm 400/7000 66.10 50.0 47.63 98.50German 700/300 74.00 18.00 50.00 79.70

Table 1: Comparison of different density estimators when used as classifiers on some UCI datasets. These datasets can be obtained fromhttp://ida.first.fraunhofer.de/projects/bench/benchmarks.htm

CAKE (Figure 3) captures the bimodal nature of thedistribution better than the other density estimators.RODEO reflects the bi-modality in the distributionbut heavily overestimates the modes.

CAKE as a classifier. Non parametric kernelclassification rule (KCR) (Devroye et al., 1996) learnsa binary classifier that labels a point as +1 if thekernel contribution to the density of positively labeledpoints is larger than those of the negative pointswhich is In this experiment we use different densityestimators in KCR and compare the accuracies ofthe resultant classifier. However, we want to seevia suggestive experiments as to how CAKE worksin high dimensions when compared to other densityestimators. Table (1) suggests good performance ofCAKE over other estimators even though we didn’tlearn estimators specifically designed for classificationtask.

6 ConclusionsWe proposed a new kernel density estimator calledCAKE which fits kernels at different training pointsby learning different convex aggregations of base ker-nels at different training points. We analyzed CAKEtheoretically and observed empirically it performs bet-ter than most estimators. It would be interesting tosee if CAKE with univariate kernels without the con-vexity constraints can be used in non-parametric re-

gression with group lasso to learn regression functionsthat are not necessarily globally sparse but are locallysparse. In the present form CAKE requires band-widths h1, . . . , hm. This provides a mechanism for theuser to inject domain knowledge. A nice extension ofour present framework would involve learning differ-ent bandwidths along different dimensions which canbe seen as learning the basis set as in sparse coding.Once the bandwidths of the base kernels are learnt wecan then use them in the CAKE framework. On thetheoretical side we have provided a risk bound that de-pends on an unknown quantity which is the empiricalL1 distance between our estimator and the true den-sity function. If one can give a data-dependent boundfor this quantity then one can use Theorem (4) to pro-vide a completely data-dependent bound on the L1distance between the CAKE estimator and the truedensity. Extension of our analysis to inhomogenousdensity functions such as Besov spaces is another fruit-ful direction.

Density CAKE Adaptive Variable RODEO RSDESUD 0.096 0.083 0.0827 0.098 0.106OD 0.80 0.85 0.691 0.886 1.70BD 0.021 0.023 0.197 0.031 0.079TD 0.141 0.156 0.167 0.145 0.119

Table 2: RMSE values of different density estimatorson various synthetic 1-d distribution.

506


References

Bishop, C., et al. (2006). Pattern recognition and machinelearning. Springer New York:.

Breiman, L., Meisel, W., & Purcell, E. (1977). Variablekernel estimates of multivariate densities. Technomet-rics.

Devroye, L., Gyrfi, L., & Lugosi, G. (1996). A ProbabilisticTheory of Pattern Recognition. Springer.

Devroye, L., & Lugosi, G. (2000). Combinatorial methodsin density estimation. Springer-Verlag.

Gasser, T., Mller, H. G., & Mammitzsch, V. (1985). Ker-nels for nonparametric curve estimation. JRSS B.

Girolami, M., & He, C. (2003). Probability density esti-mation from optimally condensed data samples. IEEEPAMI.

Gray, A., et al. (2009). Mlpack. http://mloss.org/software/view/152/.

Hall, P., Marron, J., & Titterington, D. (1995). On partiallocal smoothing rules for curve estimation. Biometrika.

Jones, M., Marron, J., & Sheather, S. (1996). A brief sur-vey of bandwidth selection for density estimation. JASA.

Keerthi, S., et al. (2001). Improvements to Platt’s SMOalgorithm for SVM classifier design. Neural Computa-tion.

Lanckriet, G. R. G., Cristianini, N., Bartlett, P.,El Ghaoui, L., & Jordan, M. I. (2004). Learning theKernel Matrix with Semidefinite Programming. JMLR,5, 27–72.

Liu, H., Lafferty, J., & Wasserman, L. (2007). Sparse non-parametric density estimation in high dimensions usingthe rodeo. AISTATS.

Marron, J., & Wand, M. (1992). Exact mean integratedsquared error. The Annals of Statistics, 712–736.

Nemirovski, A. (2000). Topics in non-parametric statistics.Lectures on probability theory and statistics.

Ong, C. S., Smola, A., & Williamson, R. C. (2005). Learn-ing the Kernel with Hyperkernels. JMLR.

Parzen, E. (1962). On the estimation of a probability den-sity function and mode. Annals of Mathematical Statis-tics.

Rigollet, P., & Tsybakov, A. (2007). Linear and convex ag-gregation of density estimators. Mathematical Methodsof Statistics.

Sain, S. (1994). Adaptive kernel density estimation. Doc-toral dissertation, Rice University.

Shawe-Taylor, J., & Dolia, A. (2007). A Framework forProbability Density Estimation. AISTATS.

Silverman, B. (1986). Density estimation for statistics anddata analysis. Chapman & Hall/CRC.

Song, L., Zhang, X., Gretton, A., Schlkopf, B., Smola, A.,& Skolnick, J. (2008). Tailoring density estimation viareproducing kernel moment matching. ICML.

Stone, C. J. (1984). An Asymptotically Optimal WindowSelection Rule for Kernel Density Estimates. The Annalsof Statistics, 12.

Tsybakov, A. (2003). Optimal rates of aggregation. COLTproceedings (p. 303).

Tsybakov, A. (2009). Introduction to nonparametric esti-mation. Springer Verlag.

Vapnik, V. N., & Mukherjee, S. (1999). Support vectormethod for multivariate density estimation. Advancesin NIPS.

Date post:	26-Sep-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

CAKE: Convex Adaptive Kernel Density...

Documents