Nonparametric Conditional Density Estimation in a High ...annlee/Appendix_JCGS.pdf · Appendix of...

$Page 1: Nonparametric Conditional Density Estimation in a High ...annlee/Appendix_JCGS.pdf · Appendix of \Nonparametric Conditional Density Estimation in a High-Dimensional Regression Setting"$
Appendix of “Nonparametric Conditional Density Estimation in a

High-Dimensional Regression Setting”

Rafael Izbicki∗ and Ann B. Lee†

A.1 Data Visualization via Spectral Series

As a by-product of the eigenvector approach, the series estimator provides a useful tool for visual-

izing and organizing complex non-standard data. Figure 1 shows a two-dimensional embedding of

the luminous red galaxy data (Section 4, main manuscript) using the first two basis functions ψ1(x)

and ψ2(x) as coordinates; i.e., we consider the eigenmap x 7→ (ψ1(x), ψ2(x)). The eigenfunctions

capture the structure of the data and vary smoothly with the response z (redshift). Similar data

points are grouped together in the eigenmap; that is, samples with similar covariates are mapped

to similar eigenfunction values (Coifman et al., 2005); see, for example, points A and B, or C and

D in the figure. As a result, when f(z|x) is smooth as a function of x (Assumption 5 in Section

3 or Assumption 5’ in Appendix A.5), the spectral series estimator yields good results. In addi-

tion, because distances in the eigenmap reflect similarity, the eigenvectors themselves are useful for

detecting clusters and outliers in the data.

∗Department of Statistics, Federal University of Sao Carlos, Brazil.†Department of Statistics, Carnegie Mellon University, USA.

1

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

ψ1

ψ2

AB

CD

Sm

ooth

ed R

edsh

ift

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.0

0.5

1.0

1.5

2.0

Fiber Model Petrosian PSFSystem

Mag

nitu

de

Color g−r i−z r−iA

0.0

0.5

1.0

1.5

2.0


Mag

nitu

de

Color g−r i−z r−iB

0.0

0.5

1.0

1.5

2.0


Mag

nitu

de

Color g−r i−z r−iC

0.0

0.5

1.0

1.5

2.0


Mag

nitu

de

Color g−r i−z r−iD

Figure 1: Top: Embedding of the luminous red galaxies from SDSS when using the first two eigenvectors of theGaussian kernel operator as coordinates. Bottom: Covariates of 4 selected galaxies. The eigenfunctions capture theconnectivity (similarity) structure of the data and vary smoothly with the response (redshift).

A.2 Details on the Galaxy Data

Due to the expansion of the Universe, the wavelength of a photon increases as it travels to us. The

redshift of a galaxy is defined as

z =λobsλemit

− 1 ,

where λobs and λemit, respectively, are the wavelengths of a photon when observed by us and when

emitted by the galaxy.

In spectroscopy, the flux of a galaxy, i.e., the number of photons emitted per unit area per

unit time, is measured as a function of wavelength. The light is collected in bins of width ∼ 1 A,

where 1 A= 10−10m. Because transitions of electrons within atoms leads to spikes and troughs in

spectra of width ∼ 1 A, and because these transitions occur at precisely known wavelengths, one

can determine the redshift of a galaxy with great precision via spectroscopy.

On the other hand, in photometry – a low-resolution but less costly alternative to spectroscopy

– the photons are collected into a few wavelength bins (also named bands). When performing

2

photometry, astronomers generally convert photon flux into a magnitude via a logarithmic trans-

formation. Whereas fluxes depend on the details of the observing instrument, (well-calibrated)

magnitudes from different instruments can be directly compared. Photometry for the Sloan Digital

Sky Survey (SDSS) is carried out in five wavelength bands, denoted u, g, r, i, and z, that span the

visible part of the electromagnetic spectrum. Each band is ∼ 1000 Awide. The differences between

contiguous magnitudes (or colors; e.g., g − r) are useful predictors for the redshift of the galaxy.

Our objective is to estimate the conditional density f(z|x), where x’s are the observed colors of a

galaxy. We train our model using redshifts obtained via spectroscopy and test our method on the

following sets of galaxy data:

Luminous Red Galaxies from SDSS. This data set consists of 3,000 luminous red galaxies

(LRGs) from the SDSS.1 We use information about 3 colors (g− r, r− i, and i− z) in psf, fiber,

petrosian and model magnitudes;2 that is, there are 3 · 4 = 12 covariates. For more details about

the data, see Freeman et al. (2009).

Galaxies from Multiple Surveys. This data set, used by Sheldon et al. (2012), includes

information on model and cmodel magnitudes of u, g, r, i, and z bands for galaxies from multiple

surveys, including SDSS, DEEP23 (Lin et al., 2004) and others. As in Sheldon et al. (2012), we base

our estimates on the color and raw r-band magnitude. Hence, there are 2 · 4 + 2 = 10 covariates.

We use random subsets of different sizes of these data for the various experiments.

COSMOS. This data set consists of magnitudes for 752 galaxies that are measured in 42

different photometric bands from 7 magnitude systems (T. Dahlen 2013, private communication).4

For each system, we compute the differences between adjacent bands. This yields a total of d = 37

colors.

A.3 Details on Simulated Data

The following schemes were used to generate data plotted in Figure 3:

Data on Manifold. Data is generated according to Z|x ∼ N(x, 0.5), where X = (X1, . . . , X20),

1http://www.sdss.org/2Multiple estimators of magnitude exist that differ in algorithmic details.3http://deep.ps.uci.edu/4http://cosmos.astro.caltech.edu/

3

http://www.sdss.org/

http://deep.ps.uci.edu/

http://cosmos.astro.caltech.edu/

(x1, . . . , xp) lie on the square [0, 1]p and xp+1, . . . , x20 = 0.

Few Relevant Covariates. Let Z|x ∼ N(1p

∑pi=1 xi, 0.5

), where X = (X1, . . . , Xd) ∼

N(0, Id). Here only the first p covariates influences the response (i.e., the conditional density

is sparse) but there is no sparse (low-dimensional) structure in X .

A.4 Variations on the Kernel Operator

Several variants of the operator from Equation 1 in the main manuscript exist in the spectral

methods literature; see e.g. von Luxburg (2007) for normalization schemes common in spectral

clustering. In this section, we describe a non-symmetric variation of the kernel operator, referred

to as the diffusion operator (e.g., Lee and Wasserman, 2010). Although the diffusion operator leads

to similar empirical performance as the kernel operator of Equation 1, the former operator has better

understood theoretical properties, especially in the limit of the bandwidth ε → 0 (Coifman and

Lafon, 2006, Belkin and Niyogi, 2008). This in turn allows us to analyze the dependence of the

spectral series estimator on tuning parameters, which ultimately leads to tighter bounds on the

loss compared to the kernel PCA operator.

In this section, we let the kernelK be a local, radially symmetric functionKε(x,y) = g (d(x,y)/√ε),

where the elements Kε(x,y) are positive and bounded for all x, y ∈ X . We use the notation Kε

to emphasize the dependence of K on the bandwidth. The first step is to renormalize the kernel

according to

aε(x,y) =Kε(x,y)

pε(x),

where pε(x) =∫Kε(x,y)dP (y). We refer to aε(x,y) as the diffusion kernel (Meila and Shi, 2001).

As in Lee and Wasserman (2010), we define a “diffusion operator” Aε as

Aε(h)(x) =

∫Xaε(x,y)h(y)dP (y). (1)

The operator Aε has a discrete set of non-negative eigenvalues λε,0 = 1 ≥ λε,1 ≥ . . . ≥ 0 with

associated eigenfunctions (ψε,i)i. The eigenfunctions are orthogonal with respect to the weighted

L2 inner product 〈f, g〉ε =∫X f(x)g(x)dSε(x), where Sε(A) =

∫A pε(x)dP (x)∫pε(x)dP (x)

can be interpreted as a

4

smoothed version of P . See Lee and Wasserman (2010) for additional interpretation and details on

how to estimate these quantities.

The construction of both the tensor basis {Ψi,j}i,j and the conditional density estimator pro-

ceeds just as in the case when using the operator from Equation 1. The only difference is that

as {Ψi,j}i,j are now orthonormal with respect to λ × Sε instead of λ × P , the coefficients of the

projection are given by

βi,j =

∫∫f(z|x)Ψi,j(z, x)dSε(x)dz =

∫∫f(z|x)Ψi,j(z, x)sε(x)dP (x)dz

=

∫∫Ψi,j(z,x)sε(x)dP (x, z) = E[Ψi,j(Z,X)sε(X)].

A.5 Bounds for Spectral Series Estimator with Kernel PCA Basis

Smoothness in x can be defined in many ways. The standard approach in the theoretical Sup-

port Vector Machine (SVM) literature (Steinwart and Christmann, 2008) is to consider the norm

of a fixed Reproducing Kernel Hilbert Space (RKHS). Here we derive rates for a spectral series

conditional density estimator with a fixed kernel, using the operator of Equation 1 in the main

manuscript. As before, we assume:

Assumption 1.∫f2(z|x)dP (x)dz <∞.

Assumption 2. Mφdef= supz supi φi(z) <∞.

Assumption 3. λ1 > λ2 > . . . > λJ > 0.

Assumption 4. (Smoothness in z direction) ∀x ∈ X , let

hx : [0, 1] −→ <

hx(z) = f(z|x).

Note that hx(z) is simply f(z|x) viewed as a function of z for a fixed x. We assume

hx ∈Wφ(sx, cx),

5

where sx and cx are such that infx sxdef= β > 1

2 and∫X c

2xdP (x) <∞.

Let HK be the Reproducing Kernel Hilbert Space (RKHS) associated to the kernel K. Instead

of Assumption 5, we here assume

Assumption 5’.[Smoothness in x direction] ∀z ∈ [0, 1], f(z|x)∈{g ∈ HK : ||g||2HK ≤ c

2z

}, where

f(z|x) is viewed as a function of x, and the cz values are such that cKdef=∫[0,1] c

2zdz <∞.

In other words, we enforce smoothness in the x direction by requiring f(z|x) to be in a RKHS for

all z’s. Smaller values of cK indicate smoother functions. The reader is referred to, e.g., Minh et al.

(2006) for an account of measuring smoothness through norms in RKHSs.

Let n be the sample size of the data used to estimate the coefficients βi,j , and let m be the

sample size of the data used to estimate the basis functions. In the next section, we prove the

following:

Theorem 1. Let fI,J(z|x) be the spectral series estimator with cutoffs I and J , based on the

eigenfunctions of the operator in Equation (1). Under Assumptions 1-A.5 and 5’, the loss

L(fI,J , f) = IJ ×[OP

(1

n

)+OP

(1

λJ∆2Jm

)]+ cKO (λJ) +O

(1

I2β

), (2)

where ∆J = min1≤j≤J |λj − λj+1| .

Some interpretation: The first term of the bound of Theorem 1 corresponds to the sampling

error of the estimator. The second and third terms correspond to the approximation error. Note

that in d+1 dimensions, the variance of traditional orthogonal series methods is∏d+1i=1 Ii×OP (1/n),

where Ii is the number of components used for the i:th variable (Efromovich, 1999). Interestingly,

the term IJ ×OP (1/n) for our estimator is the variance term of a traditional series estimator with

one tensor product only; i.e., this is the variance for the traditional case with only one explanatory

variable. The cost of estimating the basis adds the term IJ ×OP((λJ∆2

Jm)−1)

to our rate.

To provide some insight regarding the effect of estimating the basis on the final estimator, we

work out the details of two examples where the eigenvalues follow a polynomial decay λJ � J−2α

for some α > 12 . See Ji et al. (2012) for some empirical motivations for polynomial decay rates,

6

and Steinwart et al. (2009) for theory and examples. The constant α is typically related to the

dimensionality of the data.

Example 1 (Supervised Learning). For a power-law eigenvalue decay, the eigengap λJ−λJ+1 =

O(J−2α−1

). If we use the same data for estimating the basis and the coefficients of the expansion,

i.e., if n = m, then assuming a fixed kernel K, the optimal cutoffs for the bound from Theorem 1

are I � nα

8αβ+α+3β and J � nβ

8αβ+α+3β , yielding the rate OP

(n− 2αβ

8αβ+α+3β

).

Example 2 (Semi-Supervised Learning). Suppose that we have additional unlabeled data

which can be used to learn the structure of the data distribution. In the limit of infinite unla-

beled data, m → ∞, the optimal cutoffs are I � nα

2αβ+α+β and J � nβ

2αβ+α+β , yielding the rate

OP

(n− 2αβ

2αβ+α+β

).

By comparing the rates from Examples 1 and 2, we see that estimating the basis (ψj)j decreases

the rate of convergence. On the other hand, the possibility of using unlabeled data alleviates the

problem. As an illustration, let the RKHS H in Assumption A.5 be the isotropic Sobolev space

with smoothness s > d and β = s; i.e., assume that f(z|x) belongs to a Sobolev space with the

same smoothness in both directions. It is known that under certain conditions on the domain X ,

the eigenvalues λJ � J−2s/d (see, e.g., Steinwart et al., 2009, Koltchinskii and Yuan, 2010). In

this setting, the rate when m = n is OP

(n− 2s

8s+(1+3d)

). Notice similar rates (also not minimax

optimal) are obtained when learning regression functions via RKHS’s; see, e.g., Ye and Zhou, 2008

and Steinwart et al., 2009. In the limit of infinite unlabeled data, however, we achieve the rate

OP

(n− 2s

2s+(1+d)

), which is the standard minimax rate for estimating functions in d+ 1 dimensions

(recall that the conditional density is defined in d + 1 dimensions) (Stone, 1982, Hoffmann and

Lepski, 2002).

A.5.1 Proofs

To simplify the proofs, we assume the functions ψ1, . . . , ψJ are estimated using an unlabeled sample

X1, . . . , Xm, drawn independently from the sample used to estimate the coefficients βi,j . Without

loss of generality, this can be achieved by splitting the labeled sample in two. This split is only

for theoretical purposes; in practice using all data to estimate the basis leads to better results.

7

The technique also allows us to derive bounds for the semi-supervised learning setting described in

the paper, and better understand the additional cost of estimating the basis. Define the following

quantities:

fI,J(z|x) =

I∑i=1

J∑j=1

βi,jφi(z)ψj(x), βi,j =

∫∫φi(z)ψj(x)f(z,x)dxdz

fI,J(z|x) =I∑i=1

J∑j=1

βi,jφi(z)ψj(x), βi,j =1

n

n∑k=1

φi(zk)ψj(xk).

Note that

∫∫ (fI,J(z|x)− f(z|x)

)2dP (x)dz

≤∫∫ (

fI,J(z|x)− fI,J(z|x) + fI,J(z|x)− f(z|x))2dP (x)dz

≤ 2(V AR(fI,J , fI,J) +B(fI,J , f)

). (3)

where B(fI,J , f) :=∫∫

(fI,J(z|x)− f(z|x))2 dP (x)dz can be interpreted as a bias term (or approx-

imation error) and V AR(fI,J , fI,J) :=∫∫ (

fI,J(z|x)− fI,J(z|x))2dP (x)dz can be interpreted as a

variance term . First we bound the variance.

Lemma 1. ∀1 ≤ j ≤ J ,

∫ (ψj(x)− ψj(x)

)2dP (x) = OP

(1

λjδ2jm

),

where δj = λj − λj+1.

For a proof of Lemma 1 see for example Sinha and Belkin (2009).

Lemma 2. ∀1 ≤ j ≤ J , there exists C <∞ that does not depend on m such that

E

[(ψj(X)− ψj(X)

)2]< C,

where X ∼ P (x) is independent of the sample used to construct ψj.

8

Proof. Let δ ∈ (0, 1). From Sinha and Belkin (2009), it follows that

P

(∫ (ψj(x)− ψj(x)

)2dP (x) >

16 log(2δ

)δ2jm

)< δ,

and therefore ∀ε > 0,

P(∫ (

ψj(x)− ψj(x))2dP (x) > ε

)< 2e−

δ2jmε

16 .

Hence

E[(ψj(X)− ψj(X)

)2]= E

[∫ (ψj(x)− ψj(x)

)2dP (x)

]=∫ ∞

0P(∫ (

ψj(x)− ψj(x))2dP (x) > ε

)dε ≤

∫2e−

δ2jmε

16 dε <

∫2e−

δ2j ε

16 dε <∞

Lemma 3. ∀1 ≤ j ≤ J and ∀1 ≤ j ≤ J , there exists C <∞ that does not depend on m such that

E[V[φi(Z)

(ψj(X)− ψj(X)

) ∣∣∣∣X1, . . . , Xm

]]< C

Proof. Using that φ is bounded (Assumption 2), it follows that

E[V[φi(Z)

(ψj(X)− ψj(X)

) ∣∣∣∣X1, . . . , Xm

]]≤ V

[φi(Z)

(ψj(X)− ψj(X)

)]≤ E

[φ2i (Z)

(ψj(X)− ψj(X)

)2]≤ KE

[(ψj(X)− ψj(X)

)2]

for some K <∞. The result follows from Lemma 2.

Lemma 4. ∀1 ≤ j ≤ J and ∀1 ≤ j ≤ J ,

[1

n

n∑k=1

φi(Zk)(ψj(Xk)− ψj(Xk)

)−∫∫

φi(z)(ψj(x)− ψj(x)

)dP (z,x)

]2= OP

(1

n

)

9

Proof. Let A =∫∫

φi(z)(ψj(x)− ψj(x)

)dP (z,x). By Chebyshev’s inequality it holds that ∀M > 0

P

∣∣∣∣∣ 1nn∑k=1


)−A

∣∣∣∣∣2

> M

∣∣∣∣X1, . . . , Xm

≤1

nMV[φi(Z)

(ψj(X)− ψj(X)

) ∣∣∣∣X1, . . . , Xm

].

The conclusion follows from taking an expectation with respect to the unlabeled samples on both

sides of the equation and using Lemma 3.

Note that the ψ′s are random functions, and therefore the proof of Lemma 4 relies on the fact that

these functions are estimated using a different sample than X1, . . . ,Xn.

Lemma 5. ∀1 ≤ j ≤ J and ∀1 ≤ i ≤ I,

(βi,j − βi,j

)2= OP

(1

n

)+OP

(1

λjδ2jm

).

Proof. It holds that

1

2

(βi,j − βi,j

)2≤

(1

n

n∑k=1

φi(Zk)ψj(Xk)− βi,j

)2

+

(1

n

n∑k=1

φi(Zk)(ψj(Xk)− ψj(Xk))

)2

.

The first term is OP(1n

). Let A be as in the proof of Lemma 4. By using Cauchy-Schwartz’s

inequality and Lemma 4, the second term divided by two is bounded by

1

2

(1

n

n∑k=1


)−A+A

)2

≤

(1

n

n∑k=1


)−A

)2

+A2.

≤ OP(

1

n

)+

(∫∫φ2i (z)dP (z,x)

)(∫∫ (ψj(x)− ψj(x)

)2dP (z,x)

).

The result follows from Lemma 1 and the orthogonality of φi.

10

Lemma 6. [Sinha and Belkin 2009, Corollary 1] Under the stated assumptions,

∫ψ2j (x)dP (x) = OP

(1

λj∆2Jm

)+ 1

and ∫ψi(x)ψj(x)dP (x) = OP

((1√λi

+1√λj

)1

∆J√m

)

where ∆J = min1≤j≤J δj.

Lemma 7. Let h(z|x) =∑I

i=1

∑Jj=1 βi,jφi(z)ψj(x). Then

∫∫ ∣∣∣fI,J(z|x)− h(z|x)∣∣∣2 dP (x)dz = IJ

(OP

(1

n

)+OP

(1

λJ∆2Jm

)).

Proof.

∫∫ ∣∣∣fI,J(z|x)− h(z|x)∣∣∣2 dP (x)dz

=

I∑i=1

J∑j=1

J∑l=1

(βi,j − βi,j

)(βi,l − βi,l

)∫ψj(x)ψl(x)dP (x)

≤I∑i=1

J∑j=1

(βi,j − βi,j

)2 ∫ψj

2(x)dP (x)+

+

I∑i=1

J∑j=1

J∑l=1,l 6=j

(βi,j − βi,j

)(βi,l − βi,l

)∫ψj(x)ψl(x)dP (x) ≤

I∑i=1

J∑j=1

(βi,j − βi,j

)2 ∫ψj

2(x)dP (x)+

+

I∑i=1

J∑j=1

(βi,j − βi,j

)2√√√√ J∑j=1

J∑l=1,l 6=j

(∫ψj(x)ψl(x)dP (x)

)2 ,

where the last inequality follows from repeatedly using Cauchy-Schwartz. The result follows from

Lemmas 5 and 6.

Lemma 8. Let h(z|x) be as in Lemma 7. Then

11

∫∫|h(z|x)− fI,J(z|x)|2 dP (x)dz = JOP

(1

λJ∆2Jm

).

Proof. Using Cauchy-Schwartz inequality,

∫∫|h(z|x)− fI,J(z|x)|2 dP (x)dz ≤

∫∫ ∣∣∣∣∣∣I∑i=1

J∑j=1

βi,jφi(z)(ψj(x)− ψj(x)

)∣∣∣∣∣∣2

dP (x)dz

≤

J∑j=1

∫ [ I∑i=1

βi,jφi(z)

]2dz

J∑j=1

∫ [ψj(x)− ψj(x)

]2dP (x)

=

J∑j=1

I∑i=1

β2i,j

J∑j=1

∫ [ψj(x)− ψj(x)

]2dP (x)

.

The conclusion follows from Lemma 1 and by noticing that∑J

j=1

∑Ii=1 β

2i,j ≤ ||f(z|x)||2 <∞.

Using the results above, we can now bound the variance term:

Theorem 2. Under the stated assumptions,

V AR(fI,J , fI,J) = IJ

(OP

(1

n

)+OP

(1

λJ∆2Jm

)).

Proof. Let h be defined as in Lemma 7. We have

1

2V AR(fI,J , fI,J) =

1

2

∫∫ ∣∣∣fI,J(z|x)− h(z|x) + h(z|x)− fI,J(z|x)∣∣∣2 dP (x)dz

≤∫∫ ∣∣∣fI,J(z|x)− h(z|x)

∣∣∣2 dP (x)dz +

∫∫|h(z|x)− fI,J(z|x)|2 dP (x)dz.

The conclusion follows from Lemmas 7 and 8.

We next bound the bias term.

Lemma 9. For each z ∈ [0, 1], expand gz(x) in the basis ψ : gz(x) =∑

j≥1 αzjψj(x), where αzj =

12

∫gz(x)ψj(x)dP (x). We have

αzj =∑i≥1

βi,jφi(z) and

∫ (αzj)2dz =

∑i≥1

β2i,j .

Proof. The result follows from projecting αzj onto the basis φ.

Similarly, we have the following:

Lemma 10. For each x ∈ X , expand hx(z) in the basis φ : hx(z) =∑

i≥1 αxi φi(z), where αx

i =∫hx(z)φi(z)dz. We have

αxi =

∑j≥1

βi,jψi(x) and

∫(αx

i )2 dP (x) =∑j≥1

β2i,j .

Lemma 11. Using the same notation as Lemmas 9 and 10, we have

βi,j =

∫αxi ψj(x)dP (x) =

∫αzjφi(z)dz.

Proof. The result follows from plugging the definitions of αxi and αzj into the expressions above and

recalling the definition of βi,j .

Lemma 12.∑

i≥I∫

(αxi )2 dP (x) = O

(1I2s

).

Proof. By Lemma, 10, hx(z) =∑

i≥1 αxi φi(z). As by Assumption 4 hx ∈Wφ(sx, cx),

∑i≥I

I2sx (αxi )2 ≤

∑i≥I

i2sx (αxi )2 ≤ c2x.

Hence

∑i≥I

∫(αx

i )2 dP (x) ≤∫

c2xI2sx

dP (x) ≤ 1

I2βc2.

Lemma 13.∑

j≥J∫ (

αzj

)2dz = cKO(λJ).

13

Proof. Note that ||hz(.)||2HK =∑

j≥1(αzj)

2

λj(Minh, 2010). Using Assumption 5’ and that the

eigenvalues are decreasing it follows that

∑j≥J

(αzj)2

=∑j≥J

(αzj)2 λjλj≤ λJ ||hz(.)||2H ≤ λJc2z,

and therefore∑

j≥J∫ (

αzj

)2dz ≤ λJ

∫z c

2zdz = cKO(λJ).

Theorem 3. Under the stated assumptions, the bias is bounded by

B(fI,J , f) = cKO(λJ) +O

(1

I2β

).

Proof. By using orthogonality, we have that

B(fI,J , f)def=

∫∫(f(z|x)− fI,J(z|x))2 dP (x)dz ≤

∑j>J

∑i≥1

β2i,j +∑i>I

∑j≥1

β2i,j

=∑j≥J

∫ (αzj)2dz +

∑i≥I

∫(αx

i )2 dP (x),

where the last equality comes from Lemmas 9 and 10. The theorem follows from Lemmas 12 and

13.

By putting together Theorems 2 and 3 according to the bias-variance decomposition of Equation

3, we arrive at Theorem 1 in the appendix.

A.6 Proofs for the Spectral Series Estimator with Diffusion Basis

The proof of Theorem 1 in the main manuscript follows the same principles as the derivations in

the last section. The main differences are:

1. The orthogonality of the diffusion basis is defined with respect to the stationary measure S

instead of P ;

2. The bounds on the eigenfunctions of Lemma 1 have to be adapted for the diffusion basis;

14

3. The bias term should take into account the new smoothness assumption in the x direction.

Below we present the results that handle these differences. We assume n = m unless otherwise

stated, and we we make the following additional regularity assumptions:

(RC1) P has compact support X and bounded density 0 < a ≤ p(x) ≤ b <∞, ∀x ∈ X .

(RC2) The weights are positive and bounded; that is, ∀x, y ∈ X , 0 < m ≤ kε(x, y) ≤M, where

m and M are constants that do not depend on ε.

(RC3) ∀0 ≤ j ≤ J and X ∼ P , there ∃ some constant C <∞ (not depending on n) such that

E[ϕε,j(X)− ϕε,j(X)|2

]< C,

where ϕε,j(x) = ψε,j(x)sε(x) and ϕε,j(x) = widehatψε,j(x)sε(x).

In the proofs that follow, we handle the first difference by the following lemma:

Lemma 14. ∀x ∈ X ,

a

b≤ sε(x) ≤ b

a

Proof. ∀x ∈ X ,

infx∈X pε(x)

supx∈X pε(x)≤ sε(x) ≤ supx∈X pε(x)

infx∈X pε(x),

where a∫Xkε(x, y)dy ≤ pε(x) ≤ b

∫Xkε(x, y)dy.

Now, let

Gε =Aε − Iε

, (4)

where I is the identity operator. The operator Gε has the same eigenvectors ψε,j as the differential

operator Aε. Its eigenvalues are given by −ν2ε,j =λε,j−1ε , where λε,j are the eigenvalues of Aε. Define

the functional

Jε(f) = −〈Gεf, f〉ε (5)

which maps a function f ∈ L2(X , P ) into a non-negative real number. For small ε, this functional

measures the variability of the function f with respect to the distribution P .

The following result bounds the approximation error for an orthogonal series expansion of a

given function f .

15

Proposition 1. For f ∈ L2(X , P ),

∫X|f(x)− fε,J(x)|2dP (x) ≤ O

(Jε(f)

ν2ε,J+1

)(6)

where −ν2ε,J+1 is the (J + 1)th eigenvalue of Gε, and fε,J is the projection of f into the J first

eigenfunctions.

Proof. Note that Jε(f) =∑

j ν2ε,j |βε,j |2. Hence,

Jε(f)

ν2ε,J+1

=∑j

ν2ε,jν2ε,J+1

|βε,j |2 ≥∑j>J

ν2ε,jν2ε,J+1

|βε,j |2 ≥∑j>J

|βε,j |2 =

∫X|f(x)− fε,J(x)|2dSε(x).

The results follows from Lemma 14.

The total bias bound, O(Jε(f)/ν2ε,J+1

)+O

(I−2β

), is derived in the same fashion as the bound

from Theorem 3, where we put together the bound from the bias on the x direction (derived from

Proposition 1) with the bias in the z direction (derived in Lemma 12). The following two additional

results will be used to derive the bounds in the case ε −→ 0.

Denote the quantities derived from the bias-corrected kernel k∗ε by A∗ε , G∗ε , J ∗ε , etc. In the limit

ε→ 0, we have the following result:

Lemma 15. (Coifman and Lafon, 2006; Proposition 3) For f ∈ C3(X ) and x ∈ X \ ∂X ,

− limε→0

G∗ε = 4.

If X is a compact C∞ submanifold of Rd, then 4 is the psd Laplace-Beltrami operator of X defined

by 4f(x) = −∑r

j=1∂2f∂s2j

(x) where (s1, . . . , sr) are the normal coordinates of the tangent plane at x.

Lemma 16. For functions f ∈ C3(X ) whose gradients vanish at the boundary,

limε→0J ∗ε (f) =

∫X‖∇f(x)‖2dS(x).

16

Proof. By Green’s first identity

∫Xf ∇2fdS(x) +

∫X∇f · ∇fdS(x) =

∮∂X

f(∇f · n)dS(x) = 0,

where n is the normal direction to the boundary ∂X , and the last surface integral vanishes due to

the Neumann boundary condition. It follows from Lemma 15 that

limε→0J ∗ε (f) = − lim

ε→0

∫Xf(x)G∗εf(x)dSε(x) =

∫Xf(x)4f(x)dS(x) =

∫X‖∇f(x)‖2dS(x).

Let Aε denote the sample version of the integral operator Aε. To bound the difference ψε,j− ψε,j

we follow the strategy from Rosasco et al. (2010) and introduce two new integral operators that

are related to Aε and Aε, but that both act on an auxiliary5 RKHS H of smooth functions. Define

AH, AH : H → H where

AHf(x) =

∫kε(x, y)〈f,K(·, y)〉HdP (y)∫

kε(x, y)dP (y)=

∫aε(x, y)〈f,K(·, y)〉H dP (y)

AHf(x) =

∑ni=1 kε(x,Xi)〈f,K(·, Xi)〉H∑n

i=1 kε(x,Xi)=

∫aε(x, y)〈f,K(·, y)〉H dPn(y),

and K is the reproducing kernel of H. Define the operator norm ‖A‖H = supf∈H ‖Af‖H/‖f‖H

where ‖f‖2H = 〈f, f〉H. Now suppose the weight function kε is sufficiently smooth with respect

to H (Assumption 1 in Rosasco et al. 2010); this condition is for example satisfied by a Gaussian

kernel on a compact support X . By Propositions 13.3 and 14.3 in Rosasco et al. (2010), we can

then relate the functions ψε,j and ψε,j , respectively, to the eigenfunctions uε,j and uε,j of AH and

AH. We have that

‖ψε,j − ψε,j‖L2(X ,P ) = C1‖uε,j − uε,j‖L2(X ,P ) ≤ C2‖uε,j − uε,j‖H (7)

5This auxiliary space only enters the intermediate derivations and plays no role in the error analysis of thealgorithm itself.

17

for some constants C1 and C2. According to Theorem 6 in Rosasco et al. (2008) for eigenprojections

of positive compact operators, it holds that

‖uε,j − uε,j‖H ≤‖AH − AH‖H

δε,j, (8)

where δε,j is proportional to the eigengap λε,j − λε,j+1. As a result, we can bound the difference

‖ψε,j − ψε,j‖L2(X ,P ) by controlling the deviation ‖AH − AH‖H.

We choose the auxiliary RKHS H to be a Sobolev space with a sufficiently high degree of

smoothness so that certain assumptions ((RC4)-(RC5) below) are fulfilled. Let Hs denote the

Sobolev space of order s with vanishing gradients at the boundary; that is, let

Hs = {f ∈ L2(X ) | Dαf ∈ L2(X ) ∀|α| ≤ s, Dαf |∂X = 0 ∀|α| = 1},

where Dαf is the weak partial derivative of f with respect to the multi-index α, and L2(X ) is the

space of square integrable functions with respect to the Lebesgue measure. Let C3b (X ) be the set of

uniformly bounded, three times differentiable functions with uniformly bounded derivatives whose

gradients vanish at the boundary. Now suppose that H ⊂ C3b (X ) and that

(RC4) ∀f ∈ H, |α| = s, Dα(AHf −AHf) = AHDαf −AHDαf,

(RC5) ∀f ∈ H, |α| = s, Dαf ∈ C3b (X ).

Lemma 17. Let εn → 0 and nεd/2n / log(1/εn)→∞. Then, under the stated regularity conditions,

‖AH − AH‖H = OP (γn), where γn =

√log(1/εn)

nεd/2n

.

Proof. Uniformly, for all f ∈ C3b (X ), and all x in the support of P ,

|Aεf(x)− Aεf(x)| ≤ |Aεf(x)− Aεf(x)|+ |Aεf(x)− Aεf(x)|

where Aεf(x) =∫aε(x, y)f(y)dP (y). From Gine and Guillou (2002),

supx

|pε(x)− pε(x)||pε(x)pε(x)|

= OP (γn).

18

Hence,

|Aεf(x)− Aεf(x)| ≤ |pε(x)− pε(x)||pε(x)pε(x)|

∫|f(y)|kε(x, y)dP (y)

= OP (γn)

∫|f(y)|kε(x, y)dP (y)

= OP (γn).

Next, we bound Aεf(x)− Aεf(x). We have

Aεf(x)− Aεf(x) =

∫f(y)aε(x, y)(dPn(y)− dP (y))

=1

p(x) + oP (1)

∫f(y)kε(x, y)(dPn(y)− dP (y)).

Now, expand f(y) = f(x) + rn(y) where rn(y) = (y − x)T∇f(uy) and uy is between y and x. So,

∫f(y)kε(x, y)(dPn(y)−dP (y)) = f(x)

∫kε(x, y)(dPn(y)−dP (y))+

∫rn(y)kε(x, y)(dPn(y)−dP (y)).

By an application of Talagrand’s inequality to each term, as in Theorem 5.1 of Gine and Koltchinskii

(2006), we have ∫f(y)kε(x, y)(dPn(y)− dP (y)) = OP (γn).

Thus, supf∈C3b (X ) ‖Aεf −Aεf‖∞ = OP (γn).

The Sobolev space H is a Hilbert space with respect to the scalar product

〈f, g〉H = 〈f, g〉L2(X ) +∑|α|=s

〈Dαf,Dαg〉L2(X ).

Under regularity conditions (A4)-(A5),

supf∈H:‖f‖H=1

‖Aεf −Aεf‖2H ≤ supf∈H

∑|α|≤s

‖Dα(Aεf −Aεf)‖2L2(X ) =∑|α|≤s

supf∈H‖AεDαf −AεDαf‖2L2(X )

≤∑|α|≤s

supf∈C3

b (X )

‖Aεf −Aεf‖2L2(X ) ≤ C supf∈C3

b (X )

‖Aεf −Aεf‖2∞.

19

for some constant C. Hence,

supf∈H

‖Aεf −Aεf‖H‖f‖H

= supf∈H,‖f‖H=1

‖Aεf −Aεf‖H ≤ C ′ supf∈C3

b (X )

‖Aεf −Aεf‖∞ = OP (γn).�

For εn → 0 and nεd/2n / log(1/εn)→∞, it then holds that:

Proposition 2. ∀0 ≤ j ≤ J ,

‖ψε,j − ψε,j‖L2(X ,P ) = OP

(γnδε,j

),

where δε,j = λε,j − λε,j+1.

Proof. From Lemma 14 and Equation 8, we have that

‖ψε,j − ψε,j‖ε ≤√b

a‖ψε,j − ψε,j‖L2(X ,P ) ≤ C

‖AH − AH‖Hλε,j − λε,j+1

for some constant C that does not depend on n. The result follows from Lemma 17.

By putting together the bounds on bias and variance, we arrive at our main result.

Theorem 4. Under the conditions of Theorem 2 of the paper, if εn → 0 and nεd/2n / log(1/εn)→∞,

a bound on the loss of the conditional density estimator with diffusion basis is given by

L(f, f) = O

(Jε(f)

ν2ε,J+1

)+O

(1

I2β

)+ IJOP

(1

n

)+ IJOP

(γ2nεn∆2

J

),

where J (f) =∫X ‖∇f(x)‖2dS(x), ∆J = min0≤j≤J(ν2j+1− ν2j ), and ν2J+1 is the (J + 1)th eigenvalue

of 4.

A Taylor expansion yields the following:

Corollary 1. Under the conditions of Theorem 2 of the paper, if εn → 0 and nεd/2n / log(1/εn)→∞,

a bound on the loss of the conditional density estimator with diffusion basis is given by

L(f, f) =J (f)O(1) +O(εn)

ν2J+1

+O

(1

I2β

)+ IJOP

(1

n

)+ IJOP

(γ2nεn∆2

J

),

20

Corollary 2. By taking ε = n−2d+4 , and ignoring lower order terms,

L(f, f) = O

(J (f)

ν2J+1

)+O

(1

I2β

)+IJ

∆2J

OP

(log n

n

) 2d+4

. (9)

If the support of the data is on a manifold with intrinsic dimension p, the eigenvalues of 4 are

ν2j ∼ j2/p (Zhou and Srebro, 2011). Theorem 1 in the main manuscript follows.

References

Belkin, M. and P. Niyogi (2008). Towards a theoretical foundation for Laplacian-based manifold methods. Journal

of Computer and System Sciences 74 (8), 1289–1308. 4

Coifman, R. and S. Lafon (2006). Diffusion maps. Applied and Computational Harmonic Analysis 21, 5–30. 4, 16

Coifman, R. R., S. Lafon, A. B. Lee, et al. (2005). Geometric diffusions as a tool for harmonic analysis and structure

definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of

America 102 (21), 7426–7431. 1

Efromovich, S. (1999). Nonparametric Curve Estimation: Methods, Theory and Applications. Springer Series in

Statistics. Springer. 6

Freeman, P. E., J. A. Newman, A. B. Lee, J. W. Richards, and C. M. Schafer (2009). Photometric redshift estimation

using Spectral Connectivity Analysis. Monthly Notices of the Royal Astronomical Society . 3

Gine, E. and A. Guillou (2002). Rates of strong uniform consistency for multivariate kernel density estimators.

Annales de l’Institut Henri Poincare 38, 907–921. 18

Hoffmann, M. and O. Lepski (2002). Random rates in anisotropic regression. The Annals of statistics 30, 325–358. 7

Ji, M., T. Yang, B. Lin, R. Jin, and J. Han (2012). A simple algorithm for semi-supervised learning with improved

generalization error bound. In Proceedings of the 29th International Conference of Machine Learning. 6

Koltchinskii, V. and M. Yuan (2010). Sparsity in multiple kernel learning. The Annals of Statistics 38 (6), 3660–3695.

7

Lee, A. B. and L. Wasserman (2010). Spectral Connectivity Analysis. Journal of the American Statistical Associa-

tion 105 (491), 1241–1255. 4, 5

Lin, L., D. C. Koo, C. N. Willmer, et al. (2004). The DEEP2 galaxy redshift survey: Evolution of close galaxy pairs

and major-merger rates up to z˜ 1.2. The Astrophysical Journal Letters 617 (1), 9–12. 3

21

Meila, M. and J. Shi (2001). A random walks view on spectral segmentation. In Proc. Eighth International Conference

on Artificial Intelligence and Statistics. 4

Minh, H. Q. (2010). Some properties of Gaussian Reproducing Kernel Hilbert Spaces and their implications for

function approximation and learning theory. Constructive Approximation 32 (2), 307–338. 14

Minh, H. Q., P. Niyogi, and Y. Yao (2006). Mercer’s theorem, feature maps, and smoothing. In Learning Theory,

19th Annual Conference on Learning Theory. 6

Rosasco, L., M. Belkin, and E. D. Vito (2008). A note on perturbation results for learning empirical operators.

CSAIL Technical Report TR-2008-052, CBCL-274, Massachusetts Institute of Technology. 18

Rosasco, L., M. Belkin, and E. D. Vito (2010). On learning with integral operators. Journal of Machine Learning

Research 11, 905–934. 17

Sheldon, E., C. Cunha, R. Mandelbaum, J. Brinkmann, and B. Weaver (2012). Photometric redshift probability

distributions for galaxies in the SDSS DR8. The Astrophysical Journal Supplement Series 201 (2). 3

Sinha, K. and M. Belkin (2009). Semi-supervised learning using sparse eigenfunction bases. In Y. Bengio, D. Schu-

urmans, J. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), Advances in Neural Information Processing Systems

22, pp. 1687–1695. 8, 9, 11

Steinwart, I. and A. Christmann (2008). Support vector machines. Springer. 5

Steinwart, I., D. R. Hush, and C. Scovel (2009). Optimal rates for regularized least squares regression. In Proceedings

of the 22nd Annual Conference on Learning Theory, pp. 79–93. 7

Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The Annals of Statistics 10,

1040–1053. 7

von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing 17 (4), 395–416. 4

Ye, G.-B. and D.-X. Zhou (2008). Learning and approximation by Gaussians on Riemannian manifolds. Advances in

Computational Mathematics 29 (3), 291–310. 7

Zhou, X. and N. Srebro (2011). Error analysis of Laplacian eigenmaps for semi-supervised learning. In Proceedings

of the Fourteenth International Conference on Artificial Intelligence and Statistics, Volume 15, pp. 892–900. 21

22

Date post:	15-May-2018
Category:	Documents
Upload:	phungkhue
View:	221 times
Download:	1 times

Nonparametric Conditional Density Estimation in a High ...annlee/Appendix_JCGS.pdf · Appendix of...

Documents