Sparse Nonparametric Density Estimation in High Dimensions …hanliu/slides/drodeo_slides.pdf ·...

Outline

Sparse Nonparametric Density Estimationin High Dimensions Using the Rodeo

Han Liu1,2 John Lafferty2,3 Larry Wasserman1,2

1Statistics Department, 2Machine Learning Department, 3Computer ScienceDepartment, Carnegie Mellon University

July 1st, 2006

Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation

Outline

Motivation

Research background

Rodeo is a general strategy for nonparametric inference. It has beensuccessfully applied to solve sparse nonparametric regression problemsin high dimensions by Lafferty & Wasserman, 2005.

Our goal

Trying to adapt the rodeo framework to nonparametricdensity estimation problems. So that we have a unifiedframework for both density estimation and regression problems whichis computationally efficient and theoretically soundable


Outline

Outline

1 BackgroundNonparametric density estimation in high dimensionsSparsity assumptions for density estimation

2 Methodology and AlgorithmsThe main ideaThe local rodeo algorithm for the kernel density estimator

3 Asymptotic PropertiesThe asymptotic running time and minimax risk

4 Extension and VariationsThe global density rodeo and the reverse density rodeoUsing other distributions as irrelevant dimensions

5 Experimental ResultsEmpirical results on both synthetic and real-world datasets


BackgroundMethodology and Algorithms

Asymptotic PropertiesExtension and Variations

Experimental Results

Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework

Problem statement

ProblemTo estimate the joint density of a continuous d-dimensional randomvector

X = (X1, X2, ..., Xd) ∼ F , d 3

where F is the unknown distribution with density function f(x).

This problem is essentially hard, since the high dimensionality causesboth computational and theoretical problems.






Previous work

From a frequentist perspective

Kernel density estimation and the local likelihood methodProjection pursuit methodLog-spline models and the penalized likelihood method

From a Bayesian perspective

Mixture of normals with Dirichlet processes as prior

Difficulties of current approaches

Some methods only work well for low-dimensional problemsSome heuristics lack the theoretical guaranteesMore importantly, they suffer from the curse of dimensionality






The curse of dimensionality

Characterizing the curse

In a Sobolev space of order k, minimax theory shows that the bestconvergence rate for the mean squared error is

Ropt = O(n−2k/(2k+d)

)which is practically slow when the dimension d is large.

Combating the curse by some sparsity assumptions

If the high-dimensional data has a low dimensional structure or asparsity condition, we expect that some methods could be developedto combat the curse of dimensionality. This motivates thedevelopment of the rodeo framework






Rodeo for nonparametric regression (I)

Rodeo (regularization of derivative expectation operator) is a generalstrategy for nonparametric inference. Which has been used fornonparametric regression

For a regression problem

Yi = m(Xi) + εi, i = 1, . . . , n

where Xi = (Xi1, ..., Xid) ∈ Rd is a d-dimensional covariate. If m is ina d-dimensional Sobolev space of order 2, the best convergence ratefor the risk is

R∗ = O(n−4/(4+d)

)Which shows the curse of dimensionality in a regression setting.






Rodeo for nonparametric regression (II)

Assume the true function only depends on r covariates (r d)

m(x) = m(x1, ..., xr)

for any ε > 0, the rodeo can simultaneously perform bandwidthselection and (implicitly) variable selection to achieve a betterminimax convergence rate of

Rrodeo = O(n−4/(4+r)+ε

)as if the r relevant variables were explicitly isolated in advance.

Rodeo beats the curse of dimensionality in this sense. We expect toapply the same idea to solve density estimation problems.






Sparse density estimation

For many applications, the true density function can be characterizedby some low dimensional structure

Sparsity assumption for density estimation problems

Assume hjj(x) is the second partial derivative of h on the j-thvaraible, there exists some r d, such that

f(x) ∝ g(x1, ..., xr)h(x) where hjj(x) = 0 for j = 1, ..., d.

Where xR = x1, ..., xr are the relevant dimensions. This conditionimposes that h(·) belongs to a family of very smooth functions (e.g.the uniform distribution).

h(·) can be generalized to be any parametric distribution!






Generalized sparse density estimation

We can generalize h(·) to other distributions (e.g. Gaussian).

General sparsity assumption for density estimation problems

Assume h(·) is any distribution (e.g. Gaussian) that we are notinterested in

f(x) ∝ g(x1, ..., xr)h(x) where r d.

Thus, the density function f(·) can be factored into two parts: therelevant components g(·) and the irrelevant components h(·). WherexR = x1, ..., xr are the relevant dimensions.

Under this framework, we can hope to achieve a better minimax rate

R∗rodeo = O(n−4/(4+r)

)Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation





Related work

Recent work that addressed this problem

Minimum volume set (Scott& Nowak, JMLR06)Nongaussian component analysis; (Blanchard et al. JMLR06)Log-ANOVA model; (Lin & Joen, Statistical Sinica 2006)

Advantages of our approach:

Rodeo can utilize well-established nonparametric estimatorsA unified framework for different kinds of problemseasy to implement and is amenable to theoretical analysis





The main ideaThe local rodeo algorithm for the kernel density estimator

Density rodeo: the main idea

The key intuition: if a dimension is irrelevant, then changing thesmoothing parameter of that dimension should only result in a smallchange in the whole estimator

Basically, Rodeo is just a regularization strategyUse a kernel density estimator start with large bandwidthsCalculate the gradient of the estimator w.r.t. the bandwidthSequentially decrease the bandwidths in a greedy way, and try tofreeze this decay process by some thresholding strategy to achievea sparse solution







Assuming a fixed point x and let fH(x) denote an estimator of f(x)based on smoothing parameter matrix H = diag(h1, ..., hd)Let M(h) = E(fh(x)) denote the mean of fh(x), therefore,f(x) = M(0) = E(f0(x)). Assuming P = h(t) : 0 ≤ t ≤ 1 is asmooth path through the set of smoothing parameters with h(0) = 0and h(1) = 1, then

f(x) = M(1)− (M(1)−M(0)) = M(1)−∫ 1

0

dM(h(s))ds

ds

= M(1)−∫ 1

0

〈D(s), h(s)〉ds

where D(h) = ∇M(h) =(

∂M∂h1

, ..., ∂M∂hd

)T

is the gradient of M(h) and

h(s) = dh(s)ds is the derivative of h(s) along the path.







A biased, low variance estimator of M(1) is f1(x). An unbiasedestimator of D(h) is

Z(h) =

(∂fH(x)

∂h1, ...,

∂fH(x)∂hd

)T

This naive estimator has poor risk due to the large variance of Z(h)for small bandwidth. However, the sparsity assumption on f suggeststhat there should be paths for which D(h) is also sparse. Along sucha path, Z(h) could be replaced with an estimator D(h) that makesuse of the sparsity assumption. Then, the estimate of f(x) becomes

fH(x) = f1(x)−∫ 1

0

〈D(s), h(s)〉ds






Kernel density estimator

Let fH(x) represents the kernel density estimator of f(x) with abandwidth matrix H. assuming that K is a standard symmetrickernel, s.t. ∫

K(u)du = 1,

∫uK(u)du = 0d

while KH(·) = 1det(H)K(H−1·) represents the kernel with bandwidth

matrix H = diag(h1, ..., hd).

fH(x) =1

n det(H)

n∑i=1

K(H−1(x−Xi))

=1n

n∑i=i

d∏j=1

1hj

K

(xj −Xij

hj

)Here, we assume that K is a product kernel.






The rodeo statistics for the kernel density estimator

The density estimation Rodeo is based on the statistics

Zj =∂fH(x)

∂hj

=1n

n∑i=1

K(

xj−Xij

hj

)K(

xj−Xij

hj

) d∏k=1

K

(xk −Xik

hk

)≡ 1

n

n∑i=1

Zji

For the conditional variance term

s2j = Var(Zj |X1, ..., Xn)

= Var

(1n

n∑i=1

Zji|X1, ..., Xn

)=

1nVar(Zj1|X1, ..., Xn)

Here, we used the sample variance of the Zji to estimate Var(Zj1).






Density rodeo algorithms

Density Rodeo: Hard thresholding version

1. Select parameter 0 < β < 1 and initial bandwidth h0, whereh0 = c0/log log n

for some constant c0. Let cn be a sequence, s.t. cn = O(1).2. Initialize the bandwidths, and activate all dimensions:

(a) hj = h0, j = 1, 2, ..., d.(b) A = 1, 2, ..., d.

3. While A is nonempty, do for each j ∈ A(a) Compute the derivative and variance: Zj and sj .(b) Compute the threshold λj = sj

√2 log(ncn).

(c) If |Zj | > λj , then set hj ← βhj ;Otherwise remove j from A.

4. Output bandwidths h∗ and the estimator f(x) = fH∗(x)





Asymptotic Properties

The purpose of the analysis

The analysis is trying to characterize the asymptotic aspects of

the selected bandwidthsthe asymptotic running time (efficiency)the convergence rate of the risk (accuracy)

To make the theoretical results more realistic, a key aspect of ouranalysis is that we allow the dimension d to increase withthe sample size n! For this, we need to make some assumptionsabout the unknown density function to take the increasing dimensioninto account.






Assumptions

(A1) Kernel assumption:∫uuTK(u)du = v2Id and v2 <∞ and

∫K2(u)du = R(K) <∞

(A2) Dimension assumption:

d log d = O(log n)

(A3) Initial bandwidth assumption:

h(0)j = c0/log log n for (j = 1, ..., d)

Combing with A2, this implies that limn→∞ n∏d

j=1 h(0)j =∞

(A4) Sparsity assumption:

f(x) ∝ g(x1, ..., xr)h(x) where hjj(x) = 0, and satisfies r = O(1)(A5) Hessian assumption:∫

tr(HTR(u)HR(u))du <∞, and fjj(x) 6= 0 for j = 1, 2, ..., r

where HR(u) represents the Hessian for the relevant dimensions






Derivatives of both relevant and irrelevant dimensions

Key Lemma: Under assumptions A1−A5, suppose that x isinterior to the support of f . Suppose that K is a product kernel withbandwidth matrix H(s) = diag(h(s)

1 , ..., h(s)d ). Then

µ(s)j =

∂

∂h(s)j

E[fH(s)(x)− f(x)] = oP (h(s)j ) for all j ∈ Rc

For j ∈ R we have

µ(s)j =

∂

∂h(s)j

E[fH(s)(x)− f(x)] = h(s)j v2fjj(x) + oP (h(s)

j ).

Thus, for any integer s > 0, hs = h0βs, each j > r satisfies

µ(s)j = oP (h(s)

j ) = oP (h(0)j ).






Main theorem

Main Theorem: Suppose that r = O(1) and (A1)–(A5) hold. Inaddition, suppose that Amin = minj≤r |fjj(x)| = Ω(1) andAmax = maxj≤r |fjj(x)| = O(1). Then, for every ε > 0, the number ofiterations Tn until the Rodeo stops satisfies

P(

14 + r

log1/β(n1−εan) ≤ Tn ≤1

4 + rlog1/β(n1+εbn)

)−→ 1

where an = Ω(1) and bn = O(1). More over, the algorithm outputsbandwidths h∗ that satisfy

P(h∗j = h

(0)j for all j > r

)−→ 1

Also, we have

P(h

(0)j (nbn)−1/(4+r)−ε ≤ h∗j ≤ h

(0)j (nan)−1/(4+r)+ε for all j ≤ r

)−→ 1

assuming that h(0)j is defined as in A3.






Convergence rate of the risk

Theorem 2 Under the same condition of the main theorem, the riskRh∗ of the density Rodeo estimator satisfies

Rh∗ = OP

(n−4/(4+r)+ε

)for every ε > 0.

We write Yn = OP (an) to mean that Yn = O(bnan) where bn islogarithmic in n. As noted earlier, we write an = Ω(bn) iflim infn

∣∣∣an

bn

∣∣∣ > 0; similarly an = Ω(bn) if an = Ω(bncn) where cn islogarithmic in n.





Different versions of the density rodeo algorithmUsing other distributions as irrelevant dimensions

Possible extensions

The density rodeo algorithm could be extended in many ways

the soft-thresholding versionthe global versionthe reverse versionthe bootstrapping versionthe rodeo algorithm for local linear density estimatorthe greedier versionusing other distributions as irrelevant






Global density rodeo

The idea is by averaging the test statistics for multiple evaluationpoints x1, ..., xk sampled from the empirical distribution.

To avoid the cancellation problem, the statistic is squared, let Zj(xi)represents the derivative for the i-th evaluation point with respect tothe bandwidth hj . We define the test statistic

Tj =1m

m∑k=1

Z2j (xk), j = 1, ..., d

while

sj =√

Var(Tj) =1m

√2tr(C2

j ) + 4µjT Cjµj

where µ = 1m

∑mi=1 Zj(xi). The threshold is

λj = s2j + 2sj

√log(ncn)






The bootstrapping version

Instead of derive the expression explicitly, bootstrapping can be usedto evaluate the variance of Zj .

Bootstrapping Method to calculate the s2j

1. Draw a sample X∗1 , ..., X∗

n of size n, with replacement:Loop for i = 1, ..., B,

Compute the estimate Z∗ji from data X∗1 , ..., X∗

n

2. Compute the bootstrapped variance

s2j = 1

B

∑Bb=1(Z

∗ji − Zj·)2. where Zj· = 1

B

∑Bb=1 Z∗j

3. Output the resulted s2j .






The reverse and greedier version

Reverse density Rodeo: Instead of using a sequence ofdecreasing bandwidths. On the contrary, we could begin from avery small bandwidth, and use a sequence of increasingbandwidths to estimate the optimal value

Greedier density rodeo: Instead of decaying all thebandwidths, only the bandwidths associated the largest Zj/λj

quantities is decayed

Hybrid density Rodeo: Different variations could be combinedarbitrarily






Using other distributions as irrelevant dimensions

We can use a general parametric distribution h(x) as irrelevantdimensions. The key trick is that a new semi-parametric densityestimator will be used

fH(x) =h(x)

∑ni=1KH(Xi − x)

n∫KH(u− x)h(u)du

where h(x) is a parametric density estimator at point x. Themotivation of this estimator comes from local likelihood method(Loader 1996).

We see that, for one dimensional case, starting from a largebandwidth, if the true function is h(x), the algorithm willtend to freeze the bandwidth decaying process immediately.






Experimental design

The density rodeo algorithms were applied on both synthetic andreal data, including one-dimensional, two-dimensional,high-dimensional and very high-dimensional examplesto measure the distance between the estimated density functionwith the true density function. The Hellinger distance are used

D(f‖f) =∫ (√

f(x)−√

f(x))2

dx = 2− 2∫

f(x)

√f(x)f(x)

dx

Assuming that we have altogether m evaluation points, then thehellinger distance could be numerically calculated by the MonteCarlo integration

D(f‖f) ≈ 2− 2m

m∑i=1

√fH(Xi)f(Xi)






One-dimensional example: strongly skewed density

Strongly skewed density: We simulated 200 samples from theskewed distribution. The boxplot of the Hellinger distance isproduced based on 100 simulations.

This density is chosen to resemble tolognormal distribution, it distributes as

X ∼7∑

i=0

1

8N

(3

((2

3)i − 1

),

(2

3

)2i)

The true density is plotted

−3 −2 −1 0 1 2 30.

00.

20.

40.

60.

81.

01.

21.

4

Strongly skewed

x

dens

ity






Strongly skewed density: result

−3 −2 −1 0 1 2

0.0

0.5

1.0

1.5

Unbiased Cross Validation

x , bandwidth = 0.0612

dens

ity

True densityUnbiased C.V.

−3 −2 −1 0 1 2

0.0

0.5

1.0

1.5

Local kernel density Rodeo

x

dens

ity

True densityLocal Rodeo

−3 −2 −1 0 1 2

0.0

0.5

1.0

1.5

Global kernel density Rodeo


dens

ity

True densityGlobal Rodeo

−3 −2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

Bandwidth estimated

x

band

widt

h

Unibased CV Local Rodeo Global Rodeo

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20






Two-dimensional example: Combined Beta and Uniform

Combined Beta distribution with unifrom distribution asirrelevant dimension . We simulate a 2-dimensional dataset withn = 500 points.

The two dimensions are independentlygenerated as

X1 ∼ 2

3Beta(1, 2) +

1

3Beta(10, 10)

X2 ∼ Uniform(0, 1)

The true density for the relevant dimension is

0.0 0.2 0.4 0.6 0.8 1.00.

00.

51.

01.

52.

0

true relevant dimension


dens

ity






Combined Beta and Uniform: result

The first plot is the rodeo result, the second plot is the result fitted bythe built-in function KDE2d (MASS package in R)

relevant dimension

irrel

evan

t dim

ensio

n

Density

relevant dimension

irrel

evan

t dim

ensio

n

Density






Combined Beta and Uniform: marginal distribution

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

True relevant density

x

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

True irrelevant dimension

x

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

relevant by Rodeo

relevant dimension

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

irrelvant by Rodeo

irrelevant dimension

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

relevant by KDE2d

relevant dimension

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.01

0.02

0.03

0.04

irrelevant by KDE2d

irrelevant dimension

dens

ity

Numerically integrated

marginal distributions

based on the perspective

plots of the two estimators

(not normalized).






Two-dimensional example: geyser data

Geyser Data: A version of the eruptions data from the “OldFaithful” geyser in Yellowstone National Park, Wyoming. (Azzalini and Bowman 1990) and is of continuous measurementfrom August 1 to August 15, 1985.

There are altogether n = 299 samples and two variables.

“Duration” = the numeric eruption time in minutes.

“waiting” = the waiting time to next eruption.






Geyser data: result

The first plot is the rodeo result, the second plot is the result fitted bythe built-in function KDE2d (MASS package in R)

first dimension

sec

ond

dim

ensio

n

Density

first dimension

seco

nd d

imen

sion

Density






Geyser data: contour plot

The first plot is the contour plot fitted by the built-in function KDE2d(MASS package in R), the second one is fitted by the rodeo algorithm

first dimension

seco

nd d

imen

sion

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

first dimension

seco

nd d

imen

sion

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0






High dimensional example

30-dimensional example: We generate 30-dimensional syntheticdataset with r = 5 relevant dimensions (n = 100, with 30 trials). Therelevant dimensions are generated as

Xi ∼ N (0.5, (0.02i)2), for i = 1, ..., 5

while the irrelevant dimensions are generated as

Xi ∼ Uniform(0, 1), for i = 6, ..., 30

The test point is x = (12 , ..., 1

2 ).






30-dimensional example: result

The Rodeo path for the 30-dimensional synthetic dataset and theboxplot of the selected bandwidths for 30 trials

0 5 10 15 20 25

0.2

0.4

0.6

0.8

Rodeo Step

Ba

nd

wid

th

12

34

5

6789

10111213141516

17

18192021222324252627282930

X1 X4 X7 X10 X14 X18 X22 X26 X30

0.0

0.2

0.4

0.6

0.8

dimensions

sele

cte

d b

an

dw

idth

s






Very high dimensional example: image processing

The algorithm was run on 2200 grayscale images of 1s and 2s, eachwith 256 = 16× 16 pixels with some unknown background noise; thusthis is a 256-dimensional density estimation problem. A test pointand the output bandwidth plot are shown here

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0






Image processing example: evolution plot

The output bandwidth plots sampled at the Rodeo step 10, 20, 30,40, 50, 60, 70, 80, 90, 100. Which visualizes the evolution of thebandwidths and could be viewed as a dynamic process for featureselection −− the earlier a dimension’s bandwidth decays, the moreinformative it is.

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8

0.0 0.4 0.8

0.0

0.4

0.8






A example using Gaussian as irrelevant

Using Gaussian as irrelevant dimensions: We apply thesemiparametric rodeo algorithm on both 15-dimensional and20-dimensional synthetic datasets with r = 5 relevant dimensions(n = 1000). When using gaussian distributions as irrelevantdimensions, the relevant dimensions are generated as

Xi ∼ Uniform(0, 1), for i = 1, ..., 5

while the irrelevant dimensions are generated as

Xi ∼ N (0.5, (0.05i)2), for i = 6, ..., d

The test point is x = (12 , ..., 1

2 ).






Using Gaussian as irrelevant: result

The Rodeo path for the 15-dimensional synthetic data(Left) and forthe 20-dimensional data (Right)

5 10 15

0.0

0.2

0.4

0.6

0.8

Rodeo Step

Ba

nd

wid

th

12345

6789101112131415

5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

Rodeo Step

Ba

nd

wid

th

1

2

34

567

8

91011121314151617181920






Using Gaussian as irrelevant: one-dimensional case (I)

0.2 0.4 0.6 0.8 1.0

0.6

0.8

1.0

1.2

1.4

True density

x

dens

ity

0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

fitted by Rodeo

x

dens

ity

0.2 0.4 0.6 0.8 1.0

0.01

20.

014

0.01

60.

018

0.02

00.

022

0.02

4

Bandwidth estimated

x

band

widt

h

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

By unbiased CV

N = 1000 Bandwidth = 0.02894

Dens

ity

1000 one-dimensionaldata points withxi ∼ Uniform(0, 1).






Using Gaussian as irrelevant: one-dimensional case (II)

−3 −2 −1 0 1 2

0.0

0.1

0.2

0.3

0.4

Gaussian

x

dens

ity

−3 −2 −1 0 1 2

0.0

0.1

0.2

0.3

0.4

fitted by Rodeo

x

dens

ity

−3 −2 −1 0 1 2

0.05

0.10

0.15

0.20

Bandwidth estimated

x

band

widt

h

−3 −2 −1 0 1 2

0.90

0.92

0.94

0.96

0.98

1.00

1.02

correction factor

test

C0

1000 one-dimensionaldata points withxi ∼ N (0, 1).






Summary

This work adapts the general rodeo framework to solve densityestimation problemsThe sparsity assumption is crucial to guarantee the success of thedensity rodeo algorithmThe density rodeo algorithm is efficient on high-dimensionalproblems both theoretically and empirically.The rodeo framework can utilize current available densityestimators, the implementation is simpleFuture work: develop the rodeo algorithms for the other kinds ofproblems, e.g. classification.


Date post:	16-Sep-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Sparse Nonparametric Density Estimation in High Dimensions …hanliu/slides/drodeo_slides.pdf ·...

Documents