Appendix of “Nonparametric Conditional Density Estimation in a
High-Dimensional Regression Setting”
Rafael Izbicki∗ and Ann B. Lee†
A.1 Data Visualization via Spectral Series
As a by-product of the eigenvector approach, the series estimator provides a useful tool for visual-
izing and organizing complex non-standard data. Figure 1 shows a two-dimensional embedding of
the luminous red galaxy data (Section 4, main manuscript) using the first two basis functions ψ1(x)
and ψ2(x) as coordinates; i.e., we consider the eigenmap x 7→ (ψ1(x), ψ2(x)). The eigenfunctions
capture the structure of the data and vary smoothly with the response z (redshift). Similar data
points are grouped together in the eigenmap; that is, samples with similar covariates are mapped
to similar eigenfunction values (Coifman et al., 2005); see, for example, points A and B, or C and
D in the figure. As a result, when f(z|x) is smooth as a function of x (Assumption 5 in Section
3 or Assumption 5’ in Appendix A.5), the spectral series estimator yields good results. In addi-
tion, because distances in the eigenmap reflect similarity, the eigenvectors themselves are useful for
detecting clusters and outliers in the data.
∗Department of Statistics, Federal University of Sao Carlos, Brazil.†Department of Statistics, Carnegie Mellon University, USA.
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
ψ1
ψ2
AB
CD
Sm
ooth
ed R
edsh
ift
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0.0
0.5
1.0
1.5
2.0
Fiber Model Petrosian PSFSystem
Mag
nitu
de
Color g−r i−z r−iA
0.0
0.5
1.0
1.5
2.0
Fiber Model Petrosian PSFSystem
Mag
nitu
de
Color g−r i−z r−iB
0.0
0.5
1.0
1.5
2.0
Fiber Model Petrosian PSFSystem
Mag
nitu
de
Color g−r i−z r−iC
0.0
0.5
1.0
1.5
2.0
Fiber Model Petrosian PSFSystem
Mag
nitu
de
Color g−r i−z r−iD
Figure 1: Top: Embedding of the luminous red galaxies from SDSS when using the first two eigenvectors of theGaussian kernel operator as coordinates. Bottom: Covariates of 4 selected galaxies. The eigenfunctions capture theconnectivity (similarity) structure of the data and vary smoothly with the response (redshift).
A.2 Details on the Galaxy Data
Due to the expansion of the Universe, the wavelength of a photon increases as it travels to us. The
redshift of a galaxy is defined as
z =λobsλemit
− 1 ,
where λobs and λemit, respectively, are the wavelengths of a photon when observed by us and when
emitted by the galaxy.
In spectroscopy, the flux of a galaxy, i.e., the number of photons emitted per unit area per
unit time, is measured as a function of wavelength. The light is collected in bins of width ∼ 1 A,
where 1 A= 10−10m. Because transitions of electrons within atoms leads to spikes and troughs in
spectra of width ∼ 1 A, and because these transitions occur at precisely known wavelengths, one
can determine the redshift of a galaxy with great precision via spectroscopy.
On the other hand, in photometry – a low-resolution but less costly alternative to spectroscopy
– the photons are collected into a few wavelength bins (also named bands). When performing
2
photometry, astronomers generally convert photon flux into a magnitude via a logarithmic trans-
formation. Whereas fluxes depend on the details of the observing instrument, (well-calibrated)
magnitudes from different instruments can be directly compared. Photometry for the Sloan Digital
Sky Survey (SDSS) is carried out in five wavelength bands, denoted u, g, r, i, and z, that span the
visible part of the electromagnetic spectrum. Each band is ∼ 1000 Awide. The differences between
contiguous magnitudes (or colors; e.g., g − r) are useful predictors for the redshift of the galaxy.
Our objective is to estimate the conditional density f(z|x), where x’s are the observed colors of a
galaxy. We train our model using redshifts obtained via spectroscopy and test our method on the
following sets of galaxy data:
Luminous Red Galaxies from SDSS. This data set consists of 3,000 luminous red galaxies
(LRGs) from the SDSS.1 We use information about 3 colors (g− r, r− i, and i− z) in psf, fiber,
petrosian and model magnitudes;2 that is, there are 3 · 4 = 12 covariates. For more details about
the data, see Freeman et al. (2009).
Galaxies from Multiple Surveys. This data set, used by Sheldon et al. (2012), includes
information on model and cmodel magnitudes of u, g, r, i, and z bands for galaxies from multiple
surveys, including SDSS, DEEP23 (Lin et al., 2004) and others. As in Sheldon et al. (2012), we base
our estimates on the color and raw r-band magnitude. Hence, there are 2 · 4 + 2 = 10 covariates.
We use random subsets of different sizes of these data for the various experiments.
COSMOS. This data set consists of magnitudes for 752 galaxies that are measured in 42
different photometric bands from 7 magnitude systems (T. Dahlen 2013, private communication).4
For each system, we compute the differences between adjacent bands. This yields a total of d = 37
colors.
A.3 Details on Simulated Data
The following schemes were used to generate data plotted in Figure 3:
Data on Manifold. Data is generated according to Z|x ∼ N(x, 0.5), where X = (X1, . . . , X20),
1http://www.sdss.org/2Multiple estimators of magnitude exist that differ in algorithmic details.3http://deep.ps.uci.edu/4http://cosmos.astro.caltech.edu/
3
(x1, . . . , xp) lie on the square [0, 1]p and xp+1, . . . , x20 = 0.
Few Relevant Covariates. Let Z|x ∼ N(1p
∑pi=1 xi, 0.5
), where X = (X1, . . . , Xd) ∼
N(0, Id). Here only the first p covariates influences the response (i.e., the conditional density
is sparse) but there is no sparse (low-dimensional) structure in X .
A.4 Variations on the Kernel Operator
Several variants of the operator from Equation 1 in the main manuscript exist in the spectral
methods literature; see e.g. von Luxburg (2007) for normalization schemes common in spectral
clustering. In this section, we describe a non-symmetric variation of the kernel operator, referred
to as the diffusion operator (e.g., Lee and Wasserman, 2010). Although the diffusion operator leads
to similar empirical performance as the kernel operator of Equation 1, the former operator has better
understood theoretical properties, especially in the limit of the bandwidth ε → 0 (Coifman and
Lafon, 2006, Belkin and Niyogi, 2008). This in turn allows us to analyze the dependence of the
spectral series estimator on tuning parameters, which ultimately leads to tighter bounds on the
loss compared to the kernel PCA operator.
In this section, we let the kernelK be a local, radially symmetric functionKε(x,y) = g (d(x,y)/√ε),
where the elements Kε(x,y) are positive and bounded for all x, y ∈ X . We use the notation Kε
to emphasize the dependence of K on the bandwidth. The first step is to renormalize the kernel
according to
aε(x,y) =Kε(x,y)
pε(x),
where pε(x) =∫Kε(x,y)dP (y). We refer to aε(x,y) as the diffusion kernel (Meila and Shi, 2001).
As in Lee and Wasserman (2010), we define a “diffusion operator” Aε as
Aε(h)(x) =
∫Xaε(x,y)h(y)dP (y). (1)
The operator Aε has a discrete set of non-negative eigenvalues λε,0 = 1 ≥ λε,1 ≥ . . . ≥ 0 with
associated eigenfunctions (ψε,i)i. The eigenfunctions are orthogonal with respect to the weighted
L2 inner product 〈f, g〉ε =∫X f(x)g(x)dSε(x), where Sε(A) =
∫A pε(x)dP (x)∫pε(x)dP (x)
can be interpreted as a
4
smoothed version of P . See Lee and Wasserman (2010) for additional interpretation and details on
how to estimate these quantities.
The construction of both the tensor basis {Ψi,j}i,j and the conditional density estimator pro-
ceeds just as in the case when using the operator from Equation 1. The only difference is that
as {Ψi,j}i,j are now orthonormal with respect to λ × Sε instead of λ × P , the coefficients of the
projection are given by
βi,j =
∫∫f(z|x)Ψi,j(z, x)dSε(x)dz =
∫∫f(z|x)Ψi,j(z, x)sε(x)dP (x)dz
=
∫∫Ψi,j(z,x)sε(x)dP (x, z) = E[Ψi,j(Z,X)sε(X)].
A.5 Bounds for Spectral Series Estimator with Kernel PCA Basis
Smoothness in x can be defined in many ways. The standard approach in the theoretical Sup-
port Vector Machine (SVM) literature (Steinwart and Christmann, 2008) is to consider the norm
of a fixed Reproducing Kernel Hilbert Space (RKHS). Here we derive rates for a spectral series
conditional density estimator with a fixed kernel, using the operator of Equation 1 in the main
manuscript. As before, we assume:
Assumption 1.∫f2(z|x)dP (x)dz <∞.
Assumption 2. Mφdef= supz supi φi(z) <∞.
Assumption 3. λ1 > λ2 > . . . > λJ > 0.
Assumption 4. (Smoothness in z direction) ∀x ∈ X , let
hx : [0, 1] −→ <
hx(z) = f(z|x).
Note that hx(z) is simply f(z|x) viewed as a function of z for a fixed x. We assume
hx ∈Wφ(sx, cx),
5
where sx and cx are such that infx sxdef= β > 1
2 and∫X c
2xdP (x) <∞.
Let HK be the Reproducing Kernel Hilbert Space (RKHS) associated to the kernel K. Instead
of Assumption 5, we here assume
Assumption 5’.[Smoothness in x direction] ∀z ∈ [0, 1], f(z|x)∈{g ∈ HK : ||g||2HK ≤ c
2z
}, where
f(z|x) is viewed as a function of x, and the cz values are such that cKdef=∫[0,1] c
2zdz <∞.
In other words, we enforce smoothness in the x direction by requiring f(z|x) to be in a RKHS for
all z’s. Smaller values of cK indicate smoother functions. The reader is referred to, e.g., Minh et al.
(2006) for an account of measuring smoothness through norms in RKHSs.
Let n be the sample size of the data used to estimate the coefficients βi,j , and let m be the
sample size of the data used to estimate the basis functions. In the next section, we prove the
following:
Theorem 1. Let fI,J(z|x) be the spectral series estimator with cutoffs I and J , based on the
eigenfunctions of the operator in Equation (1). Under Assumptions 1-A.5 and 5’, the loss
L(fI,J , f) = IJ ×[OP
(1
n
)+OP
(1
λJ∆2Jm
)]+ cKO (λJ) +O
(1
I2β
), (2)
where ∆J = min1≤j≤J |λj − λj+1| .
Some interpretation: The first term of the bound of Theorem 1 corresponds to the sampling
error of the estimator. The second and third terms correspond to the approximation error. Note
that in d+1 dimensions, the variance of traditional orthogonal series methods is∏d+1i=1 Ii×OP (1/n),
where Ii is the number of components used for the i:th variable (Efromovich, 1999). Interestingly,
the term IJ ×OP (1/n) for our estimator is the variance term of a traditional series estimator with
one tensor product only; i.e., this is the variance for the traditional case with only one explanatory
variable. The cost of estimating the basis adds the term IJ ×OP((λJ∆2
Jm)−1)
to our rate.
To provide some insight regarding the effect of estimating the basis on the final estimator, we
work out the details of two examples where the eigenvalues follow a polynomial decay λJ � J−2α
for some α > 12 . See Ji et al. (2012) for some empirical motivations for polynomial decay rates,
6
and Steinwart et al. (2009) for theory and examples. The constant α is typically related to the
dimensionality of the data.
Example 1 (Supervised Learning). For a power-law eigenvalue decay, the eigengap λJ−λJ+1 =
O(J−2α−1
). If we use the same data for estimating the basis and the coefficients of the expansion,
i.e., if n = m, then assuming a fixed kernel K, the optimal cutoffs for the bound from Theorem 1
are I � nα
8αβ+α+3β and J � nβ
8αβ+α+3β , yielding the rate OP
(n− 2αβ
8αβ+α+3β
).
Example 2 (Semi-Supervised Learning). Suppose that we have additional unlabeled data
which can be used to learn the structure of the data distribution. In the limit of infinite unla-
beled data, m → ∞, the optimal cutoffs are I � nα
2αβ+α+β and J � nβ
2αβ+α+β , yielding the rate
OP
(n− 2αβ
2αβ+α+β
).
By comparing the rates from Examples 1 and 2, we see that estimating the basis (ψj)j decreases
the rate of convergence. On the other hand, the possibility of using unlabeled data alleviates the
problem. As an illustration, let the RKHS H in Assumption A.5 be the isotropic Sobolev space
with smoothness s > d and β = s; i.e., assume that f(z|x) belongs to a Sobolev space with the
same smoothness in both directions. It is known that under certain conditions on the domain X ,
the eigenvalues λJ � J−2s/d (see, e.g., Steinwart et al., 2009, Koltchinskii and Yuan, 2010). In
this setting, the rate when m = n is OP
(n− 2s
8s+(1+3d)
). Notice similar rates (also not minimax
optimal) are obtained when learning regression functions via RKHS’s; see, e.g., Ye and Zhou, 2008
and Steinwart et al., 2009. In the limit of infinite unlabeled data, however, we achieve the rate
OP
(n− 2s
2s+(1+d)
), which is the standard minimax rate for estimating functions in d+ 1 dimensions
(recall that the conditional density is defined in d + 1 dimensions) (Stone, 1982, Hoffmann and
Lepski, 2002).
A.5.1 Proofs
To simplify the proofs, we assume the functions ψ1, . . . , ψJ are estimated using an unlabeled sample
X1, . . . , Xm, drawn independently from the sample used to estimate the coefficients βi,j . Without
loss of generality, this can be achieved by splitting the labeled sample in two. This split is only
for theoretical purposes; in practice using all data to estimate the basis leads to better results.
7
The technique also allows us to derive bounds for the semi-supervised learning setting described in
the paper, and better understand the additional cost of estimating the basis. Define the following
quantities:
fI,J(z|x) =
I∑i=1
J∑j=1
βi,jφi(z)ψj(x), βi,j =
∫∫φi(z)ψj(x)f(z,x)dxdz
fI,J(z|x) =I∑i=1
J∑j=1
βi,jφi(z)ψj(x), βi,j =1
n
n∑k=1
φi(zk)ψj(xk).
Note that
∫∫ (fI,J(z|x)− f(z|x)
)2dP (x)dz
≤∫∫ (
fI,J(z|x)− fI,J(z|x) + fI,J(z|x)− f(z|x))2dP (x)dz
≤ 2(V AR(fI,J , fI,J) +B(fI,J , f)
). (3)
where B(fI,J , f) :=∫∫
(fI,J(z|x)− f(z|x))2 dP (x)dz can be interpreted as a bias term (or approx-
imation error) and V AR(fI,J , fI,J) :=∫∫ (
fI,J(z|x)− fI,J(z|x))2dP (x)dz can be interpreted as a
variance term . First we bound the variance.
Lemma 1. ∀1 ≤ j ≤ J ,
∫ (ψj(x)− ψj(x)
)2dP (x) = OP
(1
λjδ2jm
),
where δj = λj − λj+1.
For a proof of Lemma 1 see for example Sinha and Belkin (2009).
Lemma 2. ∀1 ≤ j ≤ J , there exists C <∞ that does not depend on m such that
E
[(ψj(X)− ψj(X)
)2]< C,
where X ∼ P (x) is independent of the sample used to construct ψj.
8
Proof. Let δ ∈ (0, 1). From Sinha and Belkin (2009), it follows that
P
(∫ (ψj(x)− ψj(x)
)2dP (x) >
16 log(2δ
)δ2jm
)< δ,
and therefore ∀ε > 0,
P(∫ (
ψj(x)− ψj(x))2dP (x) > ε
)< 2e−
δ2jmε
16 .
Hence
E[(ψj(X)− ψj(X)
)2]= E
[∫ (ψj(x)− ψj(x)
)2dP (x)
]=∫ ∞
0P(∫ (
ψj(x)− ψj(x))2dP (x) > ε
)dε ≤
∫2e−
δ2jmε
16 dε <
∫2e−
δ2j ε
16 dε <∞
Lemma 3. ∀1 ≤ j ≤ J and ∀1 ≤ j ≤ J , there exists C <∞ that does not depend on m such that
E[V[φi(Z)
(ψj(X)− ψj(X)
) ∣∣∣∣X1, . . . , Xm
]]< C
Proof. Using that φ is bounded (Assumption 2), it follows that
E[V[φi(Z)
(ψj(X)− ψj(X)
) ∣∣∣∣X1, . . . , Xm
]]≤ V
[φi(Z)
(ψj(X)− ψj(X)
)]≤ E
[φ2i (Z)
(ψj(X)− ψj(X)
)2]≤ KE
[(ψj(X)− ψj(X)
)2]
for some K <∞. The result follows from Lemma 2.
Lemma 4. ∀1 ≤ j ≤ J and ∀1 ≤ j ≤ J ,
[1
n
n∑k=1
φi(Zk)(ψj(Xk)− ψj(Xk)
)−∫∫
φi(z)(ψj(x)− ψj(x)
)dP (z,x)
]2= OP
(1
n
)
9
Proof. Let A =∫∫
φi(z)(ψj(x)− ψj(x)
)dP (z,x). By Chebyshev’s inequality it holds that ∀M > 0
P
∣∣∣∣∣ 1nn∑k=1
φi(Zk)(ψj(Xk)− ψj(Xk)
)−A
∣∣∣∣∣2
> M
∣∣∣∣X1, . . . , Xm
≤1
nMV[φi(Z)
(ψj(X)− ψj(X)
) ∣∣∣∣X1, . . . , Xm
].
The conclusion follows from taking an expectation with respect to the unlabeled samples on both
sides of the equation and using Lemma 3.
Note that the ψ′s are random functions, and therefore the proof of Lemma 4 relies on the fact that
these functions are estimated using a different sample than X1, . . . ,Xn.
Lemma 5. ∀1 ≤ j ≤ J and ∀1 ≤ i ≤ I,
(βi,j − βi,j
)2= OP
(1
n
)+OP
(1
λjδ2jm
).
Proof. It holds that
1
2
(βi,j − βi,j
)2≤
(1
n
n∑k=1
φi(Zk)ψj(Xk)− βi,j
)2
+
(1
n
n∑k=1
φi(Zk)(ψj(Xk)− ψj(Xk))
)2
.
The first term is OP(1n
). Let A be as in the proof of Lemma 4. By using Cauchy-Schwartz’s
inequality and Lemma 4, the second term divided by two is bounded by
1
2
(1
n
n∑k=1
φi(Zk)(ψj(Xk)− ψj(Xk)
)−A+A
)2
≤
(1
n
n∑k=1
φi(Zk)(ψj(Xk)− ψj(Xk)
)−A
)2
+A2.
≤ OP(
1
n
)+
(∫∫φ2i (z)dP (z,x)
)(∫∫ (ψj(x)− ψj(x)
)2dP (z,x)
).
The result follows from Lemma 1 and the orthogonality of φi.
10
Lemma 6. [Sinha and Belkin 2009, Corollary 1] Under the stated assumptions,
∫ψ2j (x)dP (x) = OP
(1
λj∆2Jm
)+ 1
and ∫ψi(x)ψj(x)dP (x) = OP
((1√λi
+1√λj
)1
∆J√m
)
where ∆J = min1≤j≤J δj.
Lemma 7. Let h(z|x) =∑I
i=1
∑Jj=1 βi,jφi(z)ψj(x). Then
∫∫ ∣∣∣fI,J(z|x)− h(z|x)∣∣∣2 dP (x)dz = IJ
(OP
(1
n
)+OP
(1
λJ∆2Jm
)).
Proof.
∫∫ ∣∣∣fI,J(z|x)− h(z|x)∣∣∣2 dP (x)dz
=
I∑i=1
J∑j=1
J∑l=1
(βi,j − βi,j
)(βi,l − βi,l
)∫ψj(x)ψl(x)dP (x)
≤I∑i=1
J∑j=1
(βi,j − βi,j
)2 ∫ψj
2(x)dP (x)+
+
I∑i=1
J∑j=1
J∑l=1,l 6=j
(βi,j − βi,j
)(βi,l − βi,l
)∫ψj(x)ψl(x)dP (x) ≤
I∑i=1
J∑j=1
(βi,j − βi,j
)2 ∫ψj
2(x)dP (x)+
+
I∑i=1
J∑j=1
(βi,j − βi,j
)2√√√√ J∑j=1
J∑l=1,l 6=j
(∫ψj(x)ψl(x)dP (x)
)2 ,
where the last inequality follows from repeatedly using Cauchy-Schwartz. The result follows from
Lemmas 5 and 6.
Lemma 8. Let h(z|x) be as in Lemma 7. Then
11
∫∫|h(z|x)− fI,J(z|x)|2 dP (x)dz = JOP
(1
λJ∆2Jm
).
Proof. Using Cauchy-Schwartz inequality,
∫∫|h(z|x)− fI,J(z|x)|2 dP (x)dz ≤
∫∫ ∣∣∣∣∣∣I∑i=1
J∑j=1
βi,jφi(z)(ψj(x)− ψj(x)
)∣∣∣∣∣∣2
dP (x)dz
≤
J∑j=1
∫ [ I∑i=1
βi,jφi(z)
]2dz
J∑j=1
∫ [ψj(x)− ψj(x)
]2dP (x)
=
J∑j=1
I∑i=1
β2i,j
J∑j=1
∫ [ψj(x)− ψj(x)
]2dP (x)
.
The conclusion follows from Lemma 1 and by noticing that∑J
j=1
∑Ii=1 β
2i,j ≤ ||f(z|x)||2 <∞.
Using the results above, we can now bound the variance term:
Theorem 2. Under the stated assumptions,
V AR(fI,J , fI,J) = IJ
(OP
(1
n
)+OP
(1
λJ∆2Jm
)).
Proof. Let h be defined as in Lemma 7. We have
1
2V AR(fI,J , fI,J) =
1
2
∫∫ ∣∣∣fI,J(z|x)− h(z|x) + h(z|x)− fI,J(z|x)∣∣∣2 dP (x)dz
≤∫∫ ∣∣∣fI,J(z|x)− h(z|x)
∣∣∣2 dP (x)dz +
∫∫|h(z|x)− fI,J(z|x)|2 dP (x)dz.
The conclusion follows from Lemmas 7 and 8.
We next bound the bias term.
Lemma 9. For each z ∈ [0, 1], expand gz(x) in the basis ψ : gz(x) =∑
j≥1 αzjψj(x), where αzj =
12
∫gz(x)ψj(x)dP (x). We have
αzj =∑i≥1
βi,jφi(z) and
∫ (αzj)2dz =
∑i≥1
β2i,j .
Proof. The result follows from projecting αzj onto the basis φ.
Similarly, we have the following:
Lemma 10. For each x ∈ X , expand hx(z) in the basis φ : hx(z) =∑
i≥1 αxi φi(z), where αx
i =∫hx(z)φi(z)dz. We have
αxi =
∑j≥1
βi,jψi(x) and
∫(αx
i )2 dP (x) =∑j≥1
β2i,j .
Lemma 11. Using the same notation as Lemmas 9 and 10, we have
βi,j =
∫αxi ψj(x)dP (x) =
∫αzjφi(z)dz.
Proof. The result follows from plugging the definitions of αxi and αzj into the expressions above and
recalling the definition of βi,j .
Lemma 12.∑
i≥I∫
(αxi )2 dP (x) = O
(1I2s
).
Proof. By Lemma, 10, hx(z) =∑
i≥1 αxi φi(z). As by Assumption 4 hx ∈Wφ(sx, cx),
∑i≥I
I2sx (αxi )2 ≤
∑i≥I
i2sx (αxi )2 ≤ c2x.
Hence
∑i≥I
∫(αx
i )2 dP (x) ≤∫
c2xI2sx
dP (x) ≤ 1
I2βc2.
Lemma 13.∑
j≥J∫ (
αzj
)2dz = cKO(λJ).
13
Proof. Note that ||hz(.)||2HK =∑
j≥1(αzj)
2
λj(Minh, 2010). Using Assumption 5’ and that the
eigenvalues are decreasing it follows that
∑j≥J
(αzj)2
=∑j≥J
(αzj)2 λjλj≤ λJ ||hz(.)||2H ≤ λJc2z,
and therefore∑
j≥J∫ (
αzj
)2dz ≤ λJ
∫z c
2zdz = cKO(λJ).
Theorem 3. Under the stated assumptions, the bias is bounded by
B(fI,J , f) = cKO(λJ) +O
(1
I2β
).
Proof. By using orthogonality, we have that
B(fI,J , f)def=
∫∫(f(z|x)− fI,J(z|x))2 dP (x)dz ≤
∑j>J
∑i≥1
β2i,j +∑i>I
∑j≥1
β2i,j
=∑j≥J
∫ (αzj)2dz +
∑i≥I
∫(αx
i )2 dP (x),
where the last equality comes from Lemmas 9 and 10. The theorem follows from Lemmas 12 and
13.
By putting together Theorems 2 and 3 according to the bias-variance decomposition of Equation
3, we arrive at Theorem 1 in the appendix.
A.6 Proofs for the Spectral Series Estimator with Diffusion Basis
The proof of Theorem 1 in the main manuscript follows the same principles as the derivations in
the last section. The main differences are:
1. The orthogonality of the diffusion basis is defined with respect to the stationary measure S
instead of P ;
2. The bounds on the eigenfunctions of Lemma 1 have to be adapted for the diffusion basis;
14
3. The bias term should take into account the new smoothness assumption in the x direction.
Below we present the results that handle these differences. We assume n = m unless otherwise
stated, and we we make the following additional regularity assumptions:
(RC1) P has compact support X and bounded density 0 < a ≤ p(x) ≤ b <∞, ∀x ∈ X .
(RC2) The weights are positive and bounded; that is, ∀x, y ∈ X , 0 < m ≤ kε(x, y) ≤M, where
m and M are constants that do not depend on ε.
(RC3) ∀0 ≤ j ≤ J and X ∼ P , there ∃ some constant C <∞ (not depending on n) such that
E[ϕε,j(X)− ϕε,j(X)|2
]< C,
where ϕε,j(x) = ψε,j(x)sε(x) and ϕε,j(x) = widehatψε,j(x)sε(x).
In the proofs that follow, we handle the first difference by the following lemma:
Lemma 14. ∀x ∈ X ,
a
b≤ sε(x) ≤ b
a
Proof. ∀x ∈ X ,
infx∈X pε(x)
supx∈X pε(x)≤ sε(x) ≤ supx∈X pε(x)
infx∈X pε(x),
where a∫Xkε(x, y)dy ≤ pε(x) ≤ b
∫Xkε(x, y)dy.
Now, let
Gε =Aε − Iε
, (4)
where I is the identity operator. The operator Gε has the same eigenvectors ψε,j as the differential
operator Aε. Its eigenvalues are given by −ν2ε,j =λε,j−1ε , where λε,j are the eigenvalues of Aε. Define
the functional
Jε(f) = −〈Gεf, f〉ε (5)
which maps a function f ∈ L2(X , P ) into a non-negative real number. For small ε, this functional
measures the variability of the function f with respect to the distribution P .
The following result bounds the approximation error for an orthogonal series expansion of a
given function f .
15
Proposition 1. For f ∈ L2(X , P ),
∫X|f(x)− fε,J(x)|2dP (x) ≤ O
(Jε(f)
ν2ε,J+1
)(6)
where −ν2ε,J+1 is the (J + 1)th eigenvalue of Gε, and fε,J is the projection of f into the J first
eigenfunctions.
Proof. Note that Jε(f) =∑
j ν2ε,j |βε,j |2. Hence,
Jε(f)
ν2ε,J+1
=∑j
ν2ε,jν2ε,J+1
|βε,j |2 ≥∑j>J
ν2ε,jν2ε,J+1
|βε,j |2 ≥∑j>J
|βε,j |2 =
∫X|f(x)− fε,J(x)|2dSε(x).
The results follows from Lemma 14.
The total bias bound, O(Jε(f)/ν2ε,J+1
)+O
(I−2β
), is derived in the same fashion as the bound
from Theorem 3, where we put together the bound from the bias on the x direction (derived from
Proposition 1) with the bias in the z direction (derived in Lemma 12). The following two additional
results will be used to derive the bounds in the case ε −→ 0.
Denote the quantities derived from the bias-corrected kernel k∗ε by A∗ε , G∗ε , J ∗ε , etc. In the limit
ε→ 0, we have the following result:
Lemma 15. (Coifman and Lafon, 2006; Proposition 3) For f ∈ C3(X ) and x ∈ X \ ∂X ,
− limε→0
G∗ε = 4.
If X is a compact C∞ submanifold of Rd, then 4 is the psd Laplace-Beltrami operator of X defined
by 4f(x) = −∑r
j=1∂2f∂s2j
(x) where (s1, . . . , sr) are the normal coordinates of the tangent plane at x.
Lemma 16. For functions f ∈ C3(X ) whose gradients vanish at the boundary,
limε→0J ∗ε (f) =
∫X‖∇f(x)‖2dS(x).
16
Proof. By Green’s first identity
∫Xf ∇2fdS(x) +
∫X∇f · ∇fdS(x) =
∮∂X
f(∇f · n)dS(x) = 0,
where n is the normal direction to the boundary ∂X , and the last surface integral vanishes due to
the Neumann boundary condition. It follows from Lemma 15 that
limε→0J ∗ε (f) = − lim
ε→0
∫Xf(x)G∗εf(x)dSε(x) =
∫Xf(x)4f(x)dS(x) =
∫X‖∇f(x)‖2dS(x).
Let Aε denote the sample version of the integral operator Aε. To bound the difference ψε,j− ψε,j
we follow the strategy from Rosasco et al. (2010) and introduce two new integral operators that
are related to Aε and Aε, but that both act on an auxiliary5 RKHS H of smooth functions. Define
AH, AH : H → H where
AHf(x) =
∫kε(x, y)〈f,K(·, y)〉HdP (y)∫
kε(x, y)dP (y)=
∫aε(x, y)〈f,K(·, y)〉H dP (y)
AHf(x) =
∑ni=1 kε(x,Xi)〈f,K(·, Xi)〉H∑n
i=1 kε(x,Xi)=
∫aε(x, y)〈f,K(·, y)〉H dPn(y),
and K is the reproducing kernel of H. Define the operator norm ‖A‖H = supf∈H ‖Af‖H/‖f‖H
where ‖f‖2H = 〈f, f〉H. Now suppose the weight function kε is sufficiently smooth with respect
to H (Assumption 1 in Rosasco et al. 2010); this condition is for example satisfied by a Gaussian
kernel on a compact support X . By Propositions 13.3 and 14.3 in Rosasco et al. (2010), we can
then relate the functions ψε,j and ψε,j , respectively, to the eigenfunctions uε,j and uε,j of AH and
AH. We have that
‖ψε,j − ψε,j‖L2(X ,P ) = C1‖uε,j − uε,j‖L2(X ,P ) ≤ C2‖uε,j − uε,j‖H (7)
5This auxiliary space only enters the intermediate derivations and plays no role in the error analysis of thealgorithm itself.
17
for some constants C1 and C2. According to Theorem 6 in Rosasco et al. (2008) for eigenprojections
of positive compact operators, it holds that
‖uε,j − uε,j‖H ≤‖AH − AH‖H
δε,j, (8)
where δε,j is proportional to the eigengap λε,j − λε,j+1. As a result, we can bound the difference
‖ψε,j − ψε,j‖L2(X ,P ) by controlling the deviation ‖AH − AH‖H.
We choose the auxiliary RKHS H to be a Sobolev space with a sufficiently high degree of
smoothness so that certain assumptions ((RC4)-(RC5) below) are fulfilled. Let Hs denote the
Sobolev space of order s with vanishing gradients at the boundary; that is, let
Hs = {f ∈ L2(X ) | Dαf ∈ L2(X ) ∀|α| ≤ s, Dαf |∂X = 0 ∀|α| = 1},
where Dαf is the weak partial derivative of f with respect to the multi-index α, and L2(X ) is the
space of square integrable functions with respect to the Lebesgue measure. Let C3b (X ) be the set of
uniformly bounded, three times differentiable functions with uniformly bounded derivatives whose
gradients vanish at the boundary. Now suppose that H ⊂ C3b (X ) and that
(RC4) ∀f ∈ H, |α| = s, Dα(AHf −AHf) = AHDαf −AHDαf,
(RC5) ∀f ∈ H, |α| = s, Dαf ∈ C3b (X ).
Lemma 17. Let εn → 0 and nεd/2n / log(1/εn)→∞. Then, under the stated regularity conditions,
‖AH − AH‖H = OP (γn), where γn =
√log(1/εn)
nεd/2n
.
Proof. Uniformly, for all f ∈ C3b (X ), and all x in the support of P ,
|Aεf(x)− Aεf(x)| ≤ |Aεf(x)− Aεf(x)|+ |Aεf(x)− Aεf(x)|
where Aεf(x) =∫aε(x, y)f(y)dP (y). From Gine and Guillou (2002),
supx
|pε(x)− pε(x)||pε(x)pε(x)|
= OP (γn).
18
Hence,
|Aεf(x)− Aεf(x)| ≤ |pε(x)− pε(x)||pε(x)pε(x)|
∫|f(y)|kε(x, y)dP (y)
= OP (γn)
∫|f(y)|kε(x, y)dP (y)
= OP (γn).
Next, we bound Aεf(x)− Aεf(x). We have
Aεf(x)− Aεf(x) =
∫f(y)aε(x, y)(dPn(y)− dP (y))
=1
p(x) + oP (1)
∫f(y)kε(x, y)(dPn(y)− dP (y)).
Now, expand f(y) = f(x) + rn(y) where rn(y) = (y − x)T∇f(uy) and uy is between y and x. So,
∫f(y)kε(x, y)(dPn(y)−dP (y)) = f(x)
∫kε(x, y)(dPn(y)−dP (y))+
∫rn(y)kε(x, y)(dPn(y)−dP (y)).
By an application of Talagrand’s inequality to each term, as in Theorem 5.1 of Gine and Koltchinskii
(2006), we have ∫f(y)kε(x, y)(dPn(y)− dP (y)) = OP (γn).
Thus, supf∈C3b (X ) ‖Aεf −Aεf‖∞ = OP (γn).
The Sobolev space H is a Hilbert space with respect to the scalar product
〈f, g〉H = 〈f, g〉L2(X ) +∑|α|=s
〈Dαf,Dαg〉L2(X ).
Under regularity conditions (A4)-(A5),
supf∈H:‖f‖H=1
‖Aεf −Aεf‖2H ≤ supf∈H
∑|α|≤s
‖Dα(Aεf −Aεf)‖2L2(X ) =∑|α|≤s
supf∈H‖AεDαf −AεDαf‖2L2(X )
≤∑|α|≤s
supf∈C3
b (X )
‖Aεf −Aεf‖2L2(X ) ≤ C supf∈C3
b (X )
‖Aεf −Aεf‖2∞.
19
for some constant C. Hence,
supf∈H
‖Aεf −Aεf‖H‖f‖H
= supf∈H,‖f‖H=1
‖Aεf −Aεf‖H ≤ C ′ supf∈C3
b (X )
‖Aεf −Aεf‖∞ = OP (γn).�
For εn → 0 and nεd/2n / log(1/εn)→∞, it then holds that:
Proposition 2. ∀0 ≤ j ≤ J ,
‖ψε,j − ψε,j‖L2(X ,P ) = OP
(γnδε,j
),
where δε,j = λε,j − λε,j+1.
Proof. From Lemma 14 and Equation 8, we have that
‖ψε,j − ψε,j‖ε ≤√b
a‖ψε,j − ψε,j‖L2(X ,P ) ≤ C
‖AH − AH‖Hλε,j − λε,j+1
for some constant C that does not depend on n. The result follows from Lemma 17.
By putting together the bounds on bias and variance, we arrive at our main result.
Theorem 4. Under the conditions of Theorem 2 of the paper, if εn → 0 and nεd/2n / log(1/εn)→∞,
a bound on the loss of the conditional density estimator with diffusion basis is given by
L(f, f) = O
(Jε(f)
ν2ε,J+1
)+O
(1
I2β
)+ IJOP
(1
n
)+ IJOP
(γ2nεn∆2
J
),
where J (f) =∫X ‖∇f(x)‖2dS(x), ∆J = min0≤j≤J(ν2j+1− ν2j ), and ν2J+1 is the (J + 1)th eigenvalue
of 4.
A Taylor expansion yields the following:
Corollary 1. Under the conditions of Theorem 2 of the paper, if εn → 0 and nεd/2n / log(1/εn)→∞,
a bound on the loss of the conditional density estimator with diffusion basis is given by
L(f, f) =J (f)O(1) +O(εn)
ν2J+1
+O
(1
I2β
)+ IJOP
(1
n
)+ IJOP
(γ2nεn∆2
J
),
20
Corollary 2. By taking ε = n−2d+4 , and ignoring lower order terms,
L(f, f) = O
(J (f)
ν2J+1
)+O
(1
I2β
)+IJ
∆2J
OP
(log n
n
) 2d+4
. (9)
If the support of the data is on a manifold with intrinsic dimension p, the eigenvalues of 4 are
ν2j ∼ j2/p (Zhou and Srebro, 2011). Theorem 1 in the main manuscript follows.
References
Belkin, M. and P. Niyogi (2008). Towards a theoretical foundation for Laplacian-based manifold methods. Journal
of Computer and System Sciences 74 (8), 1289–1308. 4
Coifman, R. and S. Lafon (2006). Diffusion maps. Applied and Computational Harmonic Analysis 21, 5–30. 4, 16
Coifman, R. R., S. Lafon, A. B. Lee, et al. (2005). Geometric diffusions as a tool for harmonic analysis and structure
definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of
America 102 (21), 7426–7431. 1
Efromovich, S. (1999). Nonparametric Curve Estimation: Methods, Theory and Applications. Springer Series in
Statistics. Springer. 6
Freeman, P. E., J. A. Newman, A. B. Lee, J. W. Richards, and C. M. Schafer (2009). Photometric redshift estimation
using Spectral Connectivity Analysis. Monthly Notices of the Royal Astronomical Society . 3
Gine, E. and A. Guillou (2002). Rates of strong uniform consistency for multivariate kernel density estimators.
Annales de l’Institut Henri Poincare 38, 907–921. 18
Hoffmann, M. and O. Lepski (2002). Random rates in anisotropic regression. The Annals of statistics 30, 325–358. 7
Ji, M., T. Yang, B. Lin, R. Jin, and J. Han (2012). A simple algorithm for semi-supervised learning with improved
generalization error bound. In Proceedings of the 29th International Conference of Machine Learning. 6
Koltchinskii, V. and M. Yuan (2010). Sparsity in multiple kernel learning. The Annals of Statistics 38 (6), 3660–3695.
7
Lee, A. B. and L. Wasserman (2010). Spectral Connectivity Analysis. Journal of the American Statistical Associa-
tion 105 (491), 1241–1255. 4, 5
Lin, L., D. C. Koo, C. N. Willmer, et al. (2004). The DEEP2 galaxy redshift survey: Evolution of close galaxy pairs
and major-merger rates up to z˜ 1.2. The Astrophysical Journal Letters 617 (1), 9–12. 3
21
Meila, M. and J. Shi (2001). A random walks view on spectral segmentation. In Proc. Eighth International Conference
on Artificial Intelligence and Statistics. 4
Minh, H. Q. (2010). Some properties of Gaussian Reproducing Kernel Hilbert Spaces and their implications for
function approximation and learning theory. Constructive Approximation 32 (2), 307–338. 14
Minh, H. Q., P. Niyogi, and Y. Yao (2006). Mercer’s theorem, feature maps, and smoothing. In Learning Theory,
19th Annual Conference on Learning Theory. 6
Rosasco, L., M. Belkin, and E. D. Vito (2008). A note on perturbation results for learning empirical operators.
CSAIL Technical Report TR-2008-052, CBCL-274, Massachusetts Institute of Technology. 18
Rosasco, L., M. Belkin, and E. D. Vito (2010). On learning with integral operators. Journal of Machine Learning
Research 11, 905–934. 17
Sheldon, E., C. Cunha, R. Mandelbaum, J. Brinkmann, and B. Weaver (2012). Photometric redshift probability
distributions for galaxies in the SDSS DR8. The Astrophysical Journal Supplement Series 201 (2). 3
Sinha, K. and M. Belkin (2009). Semi-supervised learning using sparse eigenfunction bases. In Y. Bengio, D. Schu-
urmans, J. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), Advances in Neural Information Processing Systems
22, pp. 1687–1695. 8, 9, 11
Steinwart, I. and A. Christmann (2008). Support vector machines. Springer. 5
Steinwart, I., D. R. Hush, and C. Scovel (2009). Optimal rates for regularized least squares regression. In Proceedings
of the 22nd Annual Conference on Learning Theory, pp. 79–93. 7
Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The Annals of Statistics 10,
1040–1053. 7
von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing 17 (4), 395–416. 4
Ye, G.-B. and D.-X. Zhou (2008). Learning and approximation by Gaussians on Riemannian manifolds. Advances in
Computational Mathematics 29 (3), 291–310. 7
Zhou, X. and N. Srebro (2011). Error analysis of Laplacian eigenmaps for semi-supervised learning. In Proceedings
of the Fourteenth International Conference on Artificial Intelligence and Statistics, Volume 15, pp. 892–900. 21
22