+ All Categories
Home > Documents > Bhattacharya Nonparametric

Bhattacharya Nonparametric

Date post: 13-Apr-2018
Category:
Upload: mp113
View: 237 times
Download: 0 times
Share this document with a friend

of 30

Transcript
  • 7/27/2019 Bhattacharya Nonparametric

    1/30

    Nonparametric Bayesian Density Estimation on Manifolds

    with Applications to Planar Shapes

    Abhishek Bhattacharya and David DunsonDepartment of Statistical Science, Duke University

    Abstract. Statistical analysis on landmark-based shape spaces has diverseapplications in morphometrics, medical diagnostics, machine vision, robotics

    and other areas. These shape spaces are non-Euclidean quotient manifolds,often the quotient of the unit sphere under a group of transformations. To

    conduct nonparametric inferences, one may define notions of center and spreadof a probability distribution on an arbitrary manifold and work with their es-

    timates. There has been a significant amount of work done in this direction.However, it is useful to consider full likelihood-based methods, which allownonparametric estimation of the probability density. This article proposes a

    class of mixture models constructed using suitable kernels on a general com-pact non-Euclidean manifold and then on the planar shape space in particular.

    Following a Bayesian approach with a nonparametric prior on the mixing dis-tribution, conditions are obtained under which the Kullback-Leibler propertyholds, implying large support and weak posterior consistency. Gibbs sampling

    methods are developed for posterior computation, and the methods are ap-plied to problems in density estimation on shape space and classification with

    shape-based predictors.

    1. Introduction

    In recent years, there has been considerable interest in the statistics litera-ture in the analysis of data having support on a non-Euclidean manifold M. Ourfocus is on nonparametric approaches, which avoid modeling assumptions aboutthe distribution of the data over M. Although we are particularly motivated bylandmark-based analyses of planar shapes, we develop nonparametric Bayes theoryand methods also for general manifolds.

    There is a rich literature on frequentist methods of inference on manifolds,

    which avoid a complete likelihood specification in conducting nonparametric esti-mation and testing based on manifold data. Refer, for example to Bhattacharyaand Bhattacharya [1] and the references cited therein. Such methods are basedon estimates of center and spread, which are appropriate for manifolds. However,other aspects of the distribution other than center and spread may be important. Inaddition, Bayesian likelihood-based methods have the advantage of providing a full

    Key words and phrases. Non-Euclidean manifold; Planar shape space; Nonparametric Bayes;

    Dirichlet process mixture; KL property; Posterior consistency; Discriminant analysis.

    1

  • 7/27/2019 Bhattacharya Nonparametric

    2/30

    2 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    probabilistic characterization of uncertainty, which is valid even in small samples.

    There is a very rich literature on nonparametric Bayes density estimation inEuclidean spaces, with the most commonly used method based on kernel mixturemodels of the form

    f(y; P) =

    K(y; )P(d),(1.1)

    where K is a kernel and P is a mixture distribution. For example, for univariatedensity estimation withy R, the kernel is commonly chosen as

    K(y; ) = (22)1/2 exp{ 122

    (y )2},with = (, ), leading to a mixture of Gaussians. In allowing the mixture distri-bution to be unknown through a prior distribution with large support, one obtainsa highly-flexible specification. A common choice of prior for P is the Dirichlet

    process (DP) (see Ferguson [6

    ], [7

    ]), resulting in a DP mixture (DPM) of Gaus-sians (Escobar and West[5]). Lo[14] showed that DPM location-scale mixtures ofGaussians have dense support on the space of densities with respect to Lebesguemeasure, while Ghosal et al. [8]proved posterior consistency.

    Our focus is on developing Bayesian methods for nonparametric density esti-mation on non-Euclidean manifolds Musing a specification similar to (1.1). Themanifold of special interest is the planar shape space k2 - the space of similarityshapes of configurations ofk landmarks in 2D.

    Frequentist methods for nonparametric density estimation on non-Euclideanmanifolds have been developed in Pelletier [20]. In that paper, an appropriatekernel is presented on a compact manifold which generalizes the commonly usedlocation-scale kernel on Euclidean spaces. It is used to build a kernel density

    estimate (KDE) which uses the sample points as the locations and a fixed knownband-width. It is proved that the KDE is L2 consistent for a sufficiently smallband-width.

    We use that kernel to build mixture density models on general manifolds. Forlandmark-based shape analyses, we focus on mixtures of Complex Watson (CW)distributions. The CW distribution was proposed in Watson [25],[26] as a conve-nient parametric distribution for data on spheres, and later in Dryden and Mar-dia[4]for planar shape data. Kume and Walker [12]recently proposed an MCMCmethod for posterior computation in CW parametric models.

    To do Bayesian inference, as in Euclidean spaces, the kernel must be carefullychosen, so that the induced prior will have large support, meaning that the priorassigns positive probability to arbitrarily small neighborhoods around any density

    f0. Such a support condition is important in allowing the posterior to concentratearound the true density increasingly as the sample size grows. From the theoremof Schwartz [21], prior positivity of Kullback-Leibler (KL) neighborhoods aroundthe true densityf0 implies that the posterior probability of any weak neighborhoodoff0 converges to one as n . Showing that a proposed prior has KL supportis important in providing a proof of concept that the prior is sufficiently flexible.Unfortunately, showing KL support tends to be quite difficult for new priors even inEuclidean spaces, though Wu and Ghosal [29] provide useful sufficient conditions.

  • 7/27/2019 Bhattacharya Nonparametric

    3/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 3

    In this paper, we extend those results to general manifolds and in particular to theplanar shape space using the CW kernel.

    In addition to large support, nonparametric Bayes procedures must be com-putationally tractable and lead to interpretable results in order to be useful inpractice. The enormous success of DPM models is largely due to the availabilityof efficient and easy to implement computational algorithms, such as the Polyaurn Gibbs sampler (Bush and MacEachern[3]), the block Gibbs sampler (Ishwaranand James [9]) and the exact block Gibbs sampler (Papaspilopoulos [18]). DPpriors are characterized by a precision parameter and a base probability measureP0, with computational efficiency improved whenP0 is conjugate to the likelihood.We develop efficient methods for simulating from the posterior distributions of ourmixture models using DP priors.

    Lennox et al. [13] proposed a DPM of bivariate von Mises distributions forprotein conformation angles, modifying the finite mixture model of Mardia, Tay-

    lor and Subramaniam [16]. Posterior computation relies on the auxiliary Gibbssampler of Neal [17], with efficiency improved through conditionally-conjugate up-dating. Their approach is specific to angular data and they do not present resultson support of the prior. It is potentially the case that there are certain angulardistributions that cannot be accurately characterized as mixtures of von Mises dis-tributions.

    This article is organized as follows. In Section 2, we develop kernel mixturedensity models on general compact Riemannian manifolds. Through Theorems 2.2and2.4, we provide mild sufficient conditions on the true density and the prior onthe mixing distribution so that the induced priors satisfy the KL property. Theseconditions are trivially satisfied by many standard priors such as DP. We present analgorithm based on the exact block Gibbs sampler to simulate from the posterior

    distribution of the density. These results are then applied to the unit sphere inSection3.

    Section4 provides a brief overview of the geometry of k2 . In Section 5, wepresent some important parametric distributions on this space, discuss their prop-erties and show how to sample from them. These distributions come into muchuse in the later sections to build mixture density models on k2 with large supportand for posterior computations. In Section6, we carry out nonparametric densityestimation on k2 using mixtures of CW kernels. We prove that the KL propertyholds for the induced priors under mild assumptions on the mixing priors and thetrue density in Theorems 6.2 and 6.3. We adapt the methods from Section 2.3for posterior computations using a DPM of CW kernels. We present a choice forbase measure P0 and prior band-width distribution using which we get posteriorconditional conjugacy and the computational efficiency is highly enhanced.

    We present some applications of the methods developed in Sections 7 and8.In Section8,we numerically compare the performance of our density estimate withother estimates such as KDE and parametric model based estimate. To do so, wesimulate data from a known distribution, estimate the distribution by each methodand estimate the divergence of the density estimate from the true density. It turnsout that the Bayes estimate performs much better than the other two. Finallyin Section8,we perform classification of real-world data via nonparametric Bayes

  • 7/27/2019 Bhattacharya Nonparametric

    4/30

    4 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    discriminant analysis. In this example, there are samples of male and female gorillaskull images. We estimate the shape density for each group and then estimatethe conditional probability of a skull being female given its shape, using which weclassify it as male or female.

    The proofs of our major results are presented at the end in an Appendix section.

    2. Nonparametric density estimation on general manifold

    Let (M, g) be a compact Riemannian manifold of dimension d, g being theRiemannian metric tensor. Let dg be the geodesic distance under g . Then (M, dg)is a complete metric space. Letr denote the injectivity radius ofM. SinceM iscompact, 0< r < . For p M, let TpMbe the tangent space ofM atp whichis isomorphic to Rd. Then the exponential map atp, expp : TpM M providesthe normal coordinates atp. If we denote by Bp(0, r) a ball of radius r centeredat the origin in TpM, then expp is a diffiomorphism from Bp(0, r) into M. Thisball is contained in a normal neighborhood ofp. For an Euclidean space, r =

    and the entire space can be covered by one coordinate patch.For p, mM, let Gp(m) be the volume density function on M. Ifm belongs

    to a normal neighborhood ofp, then Gp(m) is the density of the pull back of thevolume measure on M toTpMwith respect to the Lebesgue measure on TpM viathe inverse exponential map exp1p . If x denotes the normal coordinates for m,then

    Gp(m) = det(A(x))1/2 whereA(x)ij =g(

    xi,

    xj)(x), 1 i, j d,

    and Gp(p) = 1. In a normal neighborhood, G is strictly positive and Gp(m) =Gm(p) (see Willmore [28]). This volume density function can be extended as anon-negative continuous function to the whole ofMusing Jacobi fields (see[28]).

    Note that on an Euclidean space G is identically equal to 1.2.1. Mixture density models on M. Consider the kernel

    (2.1) K(m; , ) = X( dg(m, )

    )dG1 (m)

    with variablem Mand parameters (, ) MR+. Here X : [0, ) [0, ) isa continuous function satisfying

    Rd

    X(x)dx= 1. We also assume that there existsa constant A >0 such that Ar andXis zero outside [0, 1A). Then K(.; , )is a well defined function because G(.) is strictly positive on the geodesic ball

    B(, r) = {m M : dg(, m)< r}and due the support restriction onX, K is defined to be zero outside this ball.Proposition2.1proves that (2.1) defines a valid probability density on M. From

    now on, for convenience, we will write K for K(.; , ) wherever the parameters(, ) remain fixed.

    Proposition2.1. For any fixedM and (0, Ar], K defines a proba-bility density onMwith respect to the volume measure.

    Proof. We need to show that

    (2.2)

    M

    K(m; , )V(dm) = 1 M, 0< Ar,

  • 7/27/2019 Bhattacharya Nonparametric

    5/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 5

    V(dm) being the volume form ofM. SinceKis zero outsideB (, r), the integralin (2.2) can be written as

    B(,r)K(m; , )V(dm). SinceB(, r) lies in a normal

    neighborhood of, using normal coordinates intoTM, the integral can be writtenasx

  • 7/27/2019 Bhattacharya Nonparametric

    6/30

    6 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    Definition2.1. The KL neighborhood of a density f0 of size >0 is definedas

    KL(f0, ) = {f : M

    f0(m)logf0(m)

    f(m)

    V(dm)< }.A prior on the space of probability densities on M (w.r.t. the volume measure)is said to satisfy the KL condition at f0 if for any >0,

    {KL(f0, )} >0.Corollary2.3provides conditions on f0 and the prior 1 1 for the param-

    eters (P, ) corresponding to the location mixture density f in (2.5) under whichthe induced prior satisfies the KL condition at f0. Theorem 2.2provides similarconditions on the prior 2 for the mixing measure Q in the location-scale mixturedensity g in (2.6). In fact as Theorem 2.2 shows, for the location mixture den-sity model we can even prove that arbitrarily small L neighborhoods around f0get positive probability under the prior induced by 1

    1. This implies the KL

    condition at f0 as shown in corollary2.3 and also positive prior probability for L1

    neighborhoods around f0. The proofs of Theorems 2.2 and 2.4 are given in theAppendix section9.1.

    Theorem 2.2. Letf0 be a continuous density onM andF0 be the correspond-ing probability distribution. Let f(m; P, ) be a density as in (2.5). Assume thatthe prior1 1 for(P, ) contains(F0, 0) in its support. Also assume that thereexists a positive constantr1 < r such that1{(0, Ar1]} = 1. Then for any >0(2.7) (1 1){(P, ) : sup

    mM|f0(m) f(m; P, )| < } >0.

    Corollary2.3. Letf0 be a strictly positive continuous density onM. Underthe assumptions of Theorem 2.2, the prior induced by 1 1 satisfies the KLcondition atf0.

    Proof. From now on for simplicity we shall use f(m) forf(m; P, ) wheneverit is understood. Since M is compact, f0(m) > 0 for all m M implies thatinfmMf0(m) = c0 > 0. For >0 define

    W = {(P, ) : supmM

    |f0(m) f(m; P, )| < }.

    Then if (P, ) W,inf

    mMf(m) inf

    mMf0(m) c0

    2

    if we choose c02. Then for any given >0,M

    f0(m)log

    f0(m)f(m)

    V(dm) supmM

    f0(m)f(m) 1 2c0 < if we choose < c02 . Hence for sufficiently small, f(.; P, ) KL(f0, ) whenever(P, ) W. From Theorem2.2it follows that (1 1)(W) > 0 for any > 0and therefore

    (1 1){(P, ) : f(.; P, ) KL(f0, )} >0.

  • 7/27/2019 Bhattacharya Nonparametric

    7/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 7

    Theorem 2.4. Letf0 be a strictly positive continuous density onMand letF0denote its corresponding distribution. Letg(m; Q) be a density as in (2.6). Let2be a prior onQ such that

    2{M(M

    (0, Ar

    1])}

    = 1 andF0

    0

    is in the supportof2. Then the prior on the space of densities onM induced by2 satisfies theKL condition atf0.

    Remark2.1. From Proposition2.1it follows thatfandg are valid probabilitydensities if 0< Ar. However, to show the KL condition for the mixture priorsin Corollary 2.3 and Theorem 2.4, we have added the stronger restriction that0 < Ar1. This restriction on the prior to smaller bandwidth is intuitivelyreasonable because smaller bandwidth is expected to give a finer approximation tothe unknown density.

    The conditions on the priors in Theorems 2.2and2.4correspond to the sizeof the support of the priors, which are trivially satisfied by many standard non-parametric priors. For example, for model (2.5) we can choose 1 to be a Dirichlet

    process prior DP(0P0) with supp(P0) = M and 1 to have a density on (0, Ar1]that is strictly positive in some neighborhood of zero. For (2.6), we can insteadchoose the prior 2 for the mixing measure Q to correspond to a Dirichlet pro-cess with base P0 1. These choices are convenient computationally, as we willillustrate in Section2.3.

    2.3. Posterior computation. For simplicity in describing an approach forposterior computation, we focus on the location mixture specified in (2.5) witha Dirichlet process prior for the mixing measure P. In Dirichlet process mixturemodels, there are two common strategies for posterior computation, with the firstrelying on a marginal approach that integrates out the mixing measure (MacEach-ern [15], West et al. [27]) and the second relying on a conditional approach (Ish-waran and James[9]). Conditional algorithms typically rely on the stick-breakingrepresentation of Sethuraman [22], which lets P =

    j=1

    wjsj

    , with sj

    P0,

    wj =Vj

    h

  • 7/27/2019 Bhattacharya Nonparametric

    8/30

    8 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    (1) UpdateSi, fori = 1, . . . , n, by sampling from the multinomial conditionalposterior distribution with Pr(Si = j) K(Xi; sj , ) for j Ai, where

    Ai ={j : 1 j l, wj > ui} and l is the smallest index satisfying1 u(1) S(n).

    (2) Update the atomssj ,j = 1, . . . , S (n), by samplingsj from the conditional

    posterior, which is proportional to P0(dsj)

    i:Si=jK(Xi; sj , ). This is

    equivalent to sampling from the prior for components that are unoccupied.(3) Update the bandwidth parameter by sampling from the conditional

    posterior, which is proportional to 1(d)n

    i=1 K(Xi; sSi

    , ).(4) Update the stick-breaking random variablesVj , for j = 1, . . . , S (n), from

    their conditional posterior distributions given the cluster allocation butmarginalizing out the slice sampling latent variables{ui}ni=1. In particu-lar,

    Vj

    Be(1 +i

    1(Si= j), 0+i

    1(Si> j)).

    (5) Update the slice sampling latent variables from their conditional posteriorby lettingui Unif(0, wSi), for i = 1, . . . , n.

    These steps are repeated a large number of iterations, with a burn-in discarded toallow convergence. In our experience, the algorithm is quite efficient, with rapidconvergence and no evidence of slow mixing in cases we have considered. Due tolabel switching issues (Stephens [23]), we recommend assessing convergence andmixing by examining trace plots and applying standard diagnostics for the densityf(m; P, ) evaluated at a dense grid ofm values. A draw from the posterior forfor the predictive density can be calculated using

    (2.8) f(m; P, ) =

    S(n)

    j=1

    wjK(m; sj , ) + 1

    S(n)

    j=1

    wj K(m; s, )dP0(s),with and wj ,

    sj , j = 1, . . . , S (n) an MCMC draw from the joint posterior of

    the bandwidth and the weights and atoms for each of the components up to themaximum occupied. A Bayes estimate of f can then be obtained by averagingthese draws across many samples. WhenP0 is chosen to correspond to the uniformdistribution over the manifold, the integral

    K(m; s, )dP0(

    s) = 1/Vol(M). Inthis case, computing the predictive density in (2.8) becomes relatively simple. How-ever, in many cases, the uniform distribution may be overly diffuse, having a lowprobability of generating clusters close to the data. This can lead to a very largepenalty on adding new clusters as data are added, and hence underestimation ofthe number of clusters. We recommend instead choosing a non-conjugateP0, whichassigns high probability to cluster means located close to the data values, with such

    a P0 chosen based on prior knowledge, past data or empirical Bayes. We can ac-commodate non-conjugate cases by using Metropolis-Hastings sampling in steps 2and 3 and analytically approximating the integral in (2.8). One way to do so isto replace the integral by K(m; , ), being a draw from P0. Alternatively, ittends to be the case unless the data set is small that 1 jS(n) wj 0, so thatwe can accurately approximate the predictive density discarding the final term in(2.8).

    In the next section, we explicitly compute the kernel Kon the unit sphere.

  • 7/27/2019 Bhattacharya Nonparametric

    9/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 9

    3. Application to the unit sphereSd

    Consider the unit sphere in Rd+1, namely,

    Sd = {m Rd+1 : m = 1}.Statistical analysis on the sphere finds lots of application in directional data analy-sis. Also since most of the shape spaces are quotients of the sphere, it is importantto understand its geometry and how to do inference on it.

    The sphereSd is a compact Riemannian manifold of dimension dand injectivityradius of. For two pointsm1, m2 Sd, the geodesic distance between them isgiven by

    dg(m1, m2) = arccos(m1m2)

    which lies between 0 and . The tangent space at m Sd isTmS

    d = {v Rd+1 : vm= 0}.It is endowed with the metric tensor from Rd+1, i.e. g(v1, v2) v1, v2 = v1v2.The exponential map takes the form

    expm: TmSd Sd, expm(v) = cos(v)m +

    sin(v)v v.

    It is a diffieomorphism from B(0, ) ontoSd \{m}. Proposition3.1computes thevolume-density function on the sphere.

    Proposition3.1. Forp, m Sd, d >1,

    Gm(p) =

    sin(dg(m, p))

    dg(m, p)

    d1.

    OnS1, Gm(.)

    1.

    Proof. Letp Sd \ {m}. For a choice of orthonormal basis {v1, . . . , vd} forTmSd, define

    : B(0, ) {x Rd : x < } Sd \ {m},(x) = expm(x

    ivi) = cos(x)m +sin(x)x xivi.

    Then x = 1(p) gives the normal coordinates for p. Let D(x) denote the de-rivative of at x, D(x) : Rd TpSd. Then Gm(p) ={det(g(x))}1/2 whereg(x) = ( (D(x)(ei), D(x)(ej)))1i,jd and{e1, . . . , ed} denotes the canonicalbasis for Rd. Denote by V the matrix [v1, . . . , vd]. Then it is easy to show thatD(x) is the (d + 1) d matrix,

    D(x) = sin(

    x

    )

    x mx +sin(

    x

    )

    x V +cos(x)x2 sin(x)x3 V xx

    so that

    ((g(x)))i,j=sin2 x

    x2 ij+

    1 sin2 x

    x2

    xixj

    x2and hence

    g(x) =sin2 x

    x2 Id+

    1 sin2 x

    x2

    xx

    x2

  • 7/27/2019 Bhattacharya Nonparametric

    10/30

    10 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    which has eigen-values 1 with multiplicity 1 and sin2 xx2 with multiplicity d 1.

    Since det(g(x)) is the product of its eigen-values, we get

    Gm(p) =

    sin(x)x

    d1

    ifd >1

    1 ifd = 1.

    Sincex =dg(m, p), we get the desired expression for Gm(p). To get a kernel on the sphere as in (2.1), we choose the functionX such that

    X(x) defines a density on Rd with compact support. Now we can build mixturedensity models on Sd as in Section2.1. To sample from the posterior distributionof the density as in Section2.3, we need to write the conditional posteriors usinga suitable coordinate system on Sd. A natural choice is using normal coordinatesinto the tangent space of some fixed point such as the estimated center of the dis-tribution. For different notions of centers on the sphere and their properties, see

    [2]and [1].

    In the next section, we describe our main manifold of interest, namely theplanar shape space of k-ads, and carry out density estimation on it in the subsequentsections.

    4. The planar shape spacek2

    Consider a set ofk points, k > 2, on the 2D plane, not all points being thesame. We refer to such a set as a k-ad or a set ofk landmarks. The similarityshape of this k-ad is what remains after we remove the effects of the Euclideanrigid body motions of translation and rotation and scaling. For convenience wedenote a k-ad by a complex k-vector z = (z1, z2, . . . , zk)

    , i.e., we will representk-ads on a complex plane. To remove the effect of translation from z, one subtracts

    z = 1

    k

    kj=1

    zj

    fromz to bring its centroid to the origin. This centered k-adzc lies on the complex(k 1)-dimensional subspaceHk1 ofCk consisting of all vectors orthogonal to thevector1kof all ones. Using an orthonormal basis for H

    k1, we compute coordinateszH Ck1 forzc, that is

    zc=k1j=1

    zjHHj =H zH

    with H= [H1, . . . , H k1], columns of which form an orthonormal basis for Hk1.

    The effect of scaling is removed by dividing zHby its total norm zH =k1j=1 |zjH|2(which iszc). The normalized k-adw lies on the complex unit sphere

    CSk2 = {w Ck1 : w = 1}which can be identified with the real sphere S2k3. Since w contains the shapeinformation ofz along with rotation, it is called the preshape ofz . The space of allpreshapes forms the preshape sphere Sk2 which is CS

    k2 or S2k3. The similarityshape ofz is then the orbit ofw under all rotations in 2D. Since a rotation by an

  • 7/27/2019 Bhattacharya Nonparametric

    11/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 11

    angle of a landmark (x, y) can be achieved by multiplying its complex versionx + iy byei, the shape ofz (orw) is the set (or orbit)

    [w] = {ei

    w: [0, 2)}.The space of all such orbits constitutes the planar shape space k2 which is thequotient of the preshape sphere under all one dimensional rotations, that is

    k2 =Sk2 /[0, 2) = {[w] : w Sk2 }.

    Since any shape or orbit is the set of all intersection points of a unique line passingthrough the origin in Ck1 with CSk2, the planar shape space can be identifiedwith the complex projective space CPk2 which is the space of all complex linespassing through the origin in Ck1. With this identification, k2 is a compactRiemannian manifold of dimension 2k 4. It has an injective radius r of 2 . Thegeodesic distance between two shapes [u], [v] (u, v Sk2 ) is given by

    dg([u], [v]) = arccos(|uv|)where denotes the complex conjugate transpose. For m = [u]k2 , the tangentspaceTmk2 can be identified with the complex (k 2)-dimensional subspace

    Vu= {v Ck1 :uv= 0}.For a choice of orthonormal basis {v1, . . . , vk2, iv1, . . . , i vk2} forVu (over R), thenormal coordinates for a shape m1= [u1] into Tmk2 is given by

    z = (x1, . . . , xk2, y1, . . . , yk2),

    xj+ iyj = rsin(r) e

    ivj u1, j = 1, . . . , k 2,r= dg(m, m1) = z, ei = u

    1u|u1u|

    .

    k2 can be embedded into the space S(k 1,C) of all (k 1) (k 1) complexHermitian matrices via the Veronese-Whitney embedding which is given by

    J: k2 S(k 1,C), J([u]) = uu.HereS(k 1,C) is viewed as a linear subspace ofC(k1)2 of real dimension (k1)2.The extrinsic distance between two shapes [u], [v] is the one induced from thisembedding, namely,

    dE([u], [v]) = J([u]) J([v]) =

    2(1 |uv|2).

    4.1. Center and spread. Let Q be a probability distribution on k2 . Thecenter ofQ can be meaured by its extrinsic or intrinsic means while its extrinsic orintrinsic variations define notions of spread ofQ.

    The extrinsic mean ofQ is defined as the minimizer of the loss function

    F(p) =

    k2

    d2E(m, p)Q(dm), p k2(4.1)

    provided Fhas a unique minimizer. The minimum value ofFis called the extrinsicvariation ofQ. Given a sampleX1, . . . , X nfromQ, the extrinsic mean and variationof the empirical distribution 1n

    ni=1 Xi are called the sample analogues. LetQ

    J =QJ1 denote the push forward ofQ in toS(k 1,C) using the Veronese-Whitney

  • 7/27/2019 Bhattacharya Nonparametric

    12/30

    12 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    embedding J. ThenQJ has a compact support in S(k 1,C) and hence a welldefined Euclidean mean

    J

    =S(k1,C) xQ

    J

    (dx).

    SinceJ is the average of positive semi definite (p.s.d.) trace 1 matrices, it is alsop.s.d. with trace equal to 1. Proposition 4.1 identifies the extrinsic parametersofQ as functions ofJ. For a proof, see Bhattacharya and Bhattacharya [1] andBhattacharya and Patrangenaru[2].

    Proposition4.1. Letk1 denote the largest eigen-value ofJ and letUk1

    be a corresponding unit norm eigen vector. (a)Q has a unique extrinsic mean iffk1 is a eigen-value with multiplicity 1 and then the mean is given by[Uk1]. (b)The extrinsic variation ofQ equals2(1 k1).

    In defining the loss functionF in (4.1), if we replacedEby the geodesic distancedg, its minimizer defines the intrinsic mean ofQ, provided it has a unique minimizer.

    For more details on the properties of the extrinsic and intrinsic parameters and theirestimates, see [1] and [2] and the references cited therein.

    5. Parametric models on the planar shape space

    In this section we present some well known probability distributions on k2 andstudy their properties. These models will come into much use in the later sectionsfor nonparametric density estimation.

    5.1. Uniform distribution. LetV(dm) andV1(dz) denote the volume-formson the shape space k2 and the preshape sphere S

    k2 respectively. The uniform

    measure on k2 is then given by the constant density V1 where V =

    k2V(dm)

    denotes the volume of k2 . Kent[10] proposed a useful coordinate chart. Forz =

    (z1, . . . , zk1)

    onSk2 , writezj = rje

    ij

    ,j = 1, 2, . . . , k1 withr = (r1, . . . , rk2)

    on the unit simplex

    Sk2 = {r [0, 1]k2 :k2j=1

    rj 1},

    rk1 = 1 k2

    j=1rj and j (, ), j = 1, 2, . . . , k 1. Then (r1, . . . , rk2,1, . . . , k1) form the coordinates ofz, we will call that Kents preshape coordi-nates. Since the shape ofz can be obtained by rotating it around a fixed axis, wemay setk1= 0 and use the coordinates

    (r1, . . . , rk2, 1, . . . , k2)

    for [z]. These coordinates are derived in Dryden and Mardia[4], we will call them

    shape coordinates. The advantage of using these coordinate systems on Sk2 and

    k2

    is that we get simple expressions for the volume forms. It can be shown that (seeKent [10])

    V1(dz) = 22kdr1 . . . d rk2d1 . . . d k1,

    V(d[z]) = 22kdr1 . . . d rk2d1 . . . d k2.

    In other words, in terms of these shape coordinates, the uniform distribution on k2remains uniform onSk2(, )k2. This helps us simulate from this distribution

  • 7/27/2019 Bhattacharya Nonparametric

    13/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 13

    and also from the other models stated below. We shall also need this expressionfor the volume form in proving the KL condition in Section 6.

    5.2. Complex Bingham distribution. The Complex Bingham distributionon k2 has the following density with respect to the volume form:

    f(m; A) = c1(A) exp(zAz).

    Herez Sk2 is some preshape ofm k2 and the parameter A S(k 1,C), c(A)being the normalizing constant. It was proposed in Kent[10]. We will denote thisdistribution by CB(A) or just CB. Note that C B(A) = CB(A + I) for any R,so that w.l.o.g. we may assumeA to be p.s.d. with smallest eigen-value equal to 0.

    5.3. Complex Watson distribution. A special case when A has complexrank equal to 1 is the Complex Watson distribution which has the density

    f(m; , ) = c1()exp(|z|2/)

    with parameters k

    2 and >0, c() being the normalizing constant. z and are some preshapes ofm and respectively. We shall represent this distribution asCW(, ) or just C W. Note that

    CW(, ) = C B(J()/),

    Jbeing the Veronese-Whitney embedding mentioned in Section 4.

    5.4. Properties of CB and CW distributions. In case ofCB (A), writeA= UU with

    U= [U1, . . . , U k1] SU(k 1), = diag(1, . . . , k1), 0 = 1 . . . k1,

    whereSU(k1) is the space of all (k1)(k1) special unitary matrices (U U =I,det(U) = 1). This representation is called a singular value decomposition (s.v.d.)

    for A. Make a change of variable [z] [z1], z1 = Uz. This transformation doesnot change the volume form on the shape space. Then use Kents shape coordinates(r, ) for [z1]: r = (r1, . . . , rk2) Sk2, = (1, . . . , k2) (, )k2. Thenthe CB distribution can be written as

    (5.1) f([z]; A)V(d[z]) = c1(A)22k exp(k1j=1

    jrj)dr1 . . . d rk2d1 . . . d k2

    withrk1 = 1 k2

    j=1rj . Hence under the CB distribution, r has the density pro-

    portional to exp(k1

    j=1jrj) onSk2, 1, . . . , k2 are iid Unif(, ) andr andare independent.

    For the CW(, ) distribution, since 1 = . . . = k2 = 0, k1 = 1,

    therefore the distribution ofr can be written asf(r) exp(1rk1) which impliesthatrk1 has the marginal distribution

    g(rk1) = c1k1()e

    rk1/(1 rk1)k3, rk1 (0, 1) whereck1() =

    k2e1/(k 2; 1) and

    (m; a) =

    a0

    ettm1dt= (m 1)!ea

    ea m1r=0

    ar

    r!

  • 7/27/2019 Bhattacharya Nonparametric

    14/30

    14 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    denotes the partial gamma function. Conditioned onrk1, r = (r1, . . . , rk2) hasa uniform distribution on the set

    {rj 0, j = 1, . . . , k 2,k2

    1

    rj = 1 rk1}.

    Normalizing constants. Expression (5.1) suggests that for theC B(A) distribu-tion, c(A) depends on A only through its eigen values and hence is equal to c().For the CW distribution, c() can be derived to be

    c() = 22k(2)k2Sk2

    erk1/dr2 . . . d rk1

    = 22k(2)k2(k 3)!1 1

    0

    erk1/(1 rk1)k3drk1= ()(k2)e1/(k 3)!1(k 2; 1)

    = ()(k2)

    e1/ k3r=0

    r

    r!

    .

    In [4], the CB & CW distributions are viewed as distributions on the preshapesphere and hence the normalizing constant is derived to be 2c().

    Extrinsic mean and variation. Let X1 CB (A). Then the extrinsic meanfor the CB distribution can be expressed as the shape of a unit eigen-vector corre-sponding to the largest eigen-value ofJ = E[J(X1)]. Letz be one of the preshapesofX1, z1 = Uz and (r, ) be the shape coordinates for [z1]. Then

    J =E[zz] = U E[z1z1 ]U

    .

    Take k1 = 0. Then since

    (z1z1 )ij = rirjei(ij), 1 i, j k 1,therefore

    E(z1z1 )ij =

    0 ifi =jE(ri) ifi = j.

    Hence

    J =Udiag(E(r1), . . . , E (rk1))U

    and the extrinsic mean E = [Uj0] where E(rj0) = max1jk1 E(rj) providedthere is a unique such j0. The extrinsic variation is 2(1 E(rj0)).

    For the C W(, ) distribution,

    E(r1) = . . .= E(rk2) = 1E(rk1)

    k2 and

    E(rk1) = c1k1()1

    0ex/x(1 x)k3dx

    = 1 (k1;1)(k2;1) = 1 (k 2)1e1/

    k2

    0 r!1r

    1e1/

    k3

    0 r!1r

    .

    It can be shown thatE(rk1)> 1k1 and hence E(rk1)> E(rj), j = 1, . . . , k 2.

    Therefore for this distribution, the extrinsic mean is

    E= [Uk1] =

  • 7/27/2019 Bhattacharya Nonparametric

    15/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 15

    and the extrinsic variation equals

    VE= 2(1 E(rk1)) = 2(k 2)1

    e1/k20 r!1r

    1 e1/k30 r!1r .For small, E(rk1) 1 (k 2) and then VE 2(k 2).

    5.5. Simulation from CW distribution. To draw a sample fromCW(, ),we may draw 1, . . . , k2 iid from Unif(, ), set k1 = 0, and draw r =(r1, . . . , rk2), rk1 = 1

    k2j=1rj from the distribution f(r) exp(1rk1)

    on Sk2. Let z1 = (z11 , . . . , z

    k11 ), z

    j1 =

    rje

    ij , get U SU(k 1) such thatJ() = UU, = diag(0, . . . , 0, 1) and set z = U z1. Then [z] is a random samplefrom the Complex Watson distribution.

    We saw in Section 5.4 that under the distribution f, rk1 has the marginaldistribution g(rk1)

    erk1/(1

    rk1)

    k3 on (0, 1). Make the transformation

    sk1 = 1(1rk1). Thensk1 follows the distribution h(sk1) esk1sk3k1 on(0, 1) which is Gamma(k 2, 1) density restricted to (0, 1). Drawsk1 by theinverse-cdf method and set rk1 = 1 sk1. Then draw (s1, . . . , sk2) from theDirichlet distribution with all parameters set equal to 1 and set rj = (1 rk1)sj ,

    j = 1, . . . , k 2. This gives us a draw from f.5.6. Simulation from CB distribution. For the CB(A) distribution, we

    saw in Section 5.4 that r f(r) exp(k1j=1jrj) on Sk2. Unless we havesome more information on the eigen-values j as in case of CW(, ), it is noteasy to simulate exactly from f. We may instead use a full conditional Gibbssampling method to draw r. Draw rj for j = 1, . . . , k 2 from the density pro-portional to exp((j k1)rj) on (0, 1 r(j)) using the inverse-cdf method wherer

    (j)

    =k2

    i=1i=jri. Then setrk1 = 1 k2j=1rj . Draw 1, . . . , k1 and compute[z] as in Section5.5.

    Under high concentrations, that is when k1 >> k2, a more effective ap-proach would be to use an independent exponential approximation. That is drawrj ,

    j = 1, . . . , k 2 independently from the density proportional to exp((j k1)rj)on (0, 1). Accept the draw if

    k2i=1 ri 1 and then set rk1 = 1

    k2j=1rj .

    6. Density estimation on k2

    Since the planar shape space is a compact Riemannian manifold, we could usethe kernel defined in Section 2.1 to build mixture density models on this space.However it is not easy to get an exact expression for the kernel because of the

    volume density term involved. Also to simulate from the posterior distribution ofthe density as in Section 2.3, if we write the conditional posteriors of the atomsand bandwidth using normal coordinates, then the expressions become messy. It isnot easy to sample from them due to lack of conjugacy. In this section, we presentan alternative kernel and construct mixture density models using that, for whichthe theoretical and numerical computations are greatly simplified, as we shall seein the subsequent sections.

  • 7/27/2019 Bhattacharya Nonparametric

    16/30

    16 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    Consider the Complex Watson kernel on the planar shape space as mentionedin Section5.3. That is,

    K(m; , ) =c1()exp|x

    y|2

    (6.1)

    wherem, k2 , x, y are some preshapes ofm, respectively and

    c() = ()(k2)[e1/ k3r=0

    r

    r! ], R+.

    Then for fixed, ;K(.; , ) defines a valid probability density on k2 with respectto the volume measure. As shown in Section5.4,it has an extrinsic mean Eequalto and extrinsic variation

    VE= 2(k 2) 1 e1/k2

    0 r!1r

    1

    e1/k30 r!1r

    which is approximately a constant multiple of when is small. Note that K canbe written as

    c1()exp

    1

    cos2 dg(m, )

    .

    Hence it is similar to the kernel in equation (2.1), except that now we need not putany constraint on the support ofX or for it to be a valid probability density.

    Using this kernel, we can define a location mixture or a location-scale mixturedensity model on k2 as in (2.5) and (2.6) respectively. We set priors on the mixingparameters which induce corresponding priors on the space of densities. We provethat the induced priors satisfy the KL condition for both models which implyposterior consistency for the Bayes estimates of the densities. The computations

    are greatly simplified by using the shape coordinates described in Section 5.1 dueto the fact that under this coordinate system, the uniform measure on k2 remainsuniform onSk2 (, )k2. This observation helps us remove the constraints onthe kernel parameters and prove KL property under lesser restrictions. It is provedin Theorems6.2and6.3. In proving them, we will use the following lemma. Theproof is given in the Appendix section9.2.

    Lemma 6.1. LetF0 be an absolutely continuous probability distribution onk2and letf0 be its density. For a probabilityP onk2 and >0, define

    f(m; P, ) =

    k2

    K(m; , )P(d)

    which is a valid probability density. Assume thatf0 is Holder Continuous on the

    metric space (k2 , dE), i.e. there exists constantsA,a > 0 such that for any twopointsp, q k2 ,

    |f0(p) f0(q)| AdE(p, q)a.Then

    supmk2

    |f(m; F0, ) f0(m)| 0

    as 0.

  • 7/27/2019 Bhattacharya Nonparametric

    17/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 17

    Theorem 6.2 states the KL property for the prior induced on the space ofdensities on k2 using a location mixture density model while Theorem 6.3states itin case of a location-scale mixture density model. The proofs follow from Lemma 6.1

    just as Theorems2.2 and 2.4 use Lemma9.1 in their proofs, once we note that Kis continuous in m,, on k2 k2R+. Hence the proofs are omitted.

    Theorem 6.2. Let f0 be a Holder continuous density on k2 and F0 be thecorresponding probability distribution. Define

    (6.2) f(m; P, ) =

    k2

    K(m; , )P(d)

    withKas in (6.1). Let1 be a prior onM(k2 ) which containsF0 in its support.Let1 be a prior onR

    + containing 0 in its support. Then for any >0,

    (1 1){(P, ) : supmk2

    |f0(m) f(m; P, )| < } >0.

    Further iff0(m)> 0 m k2 , then(1 1){(P, ) : f(.; P, ) KL(f0, )} >0.

    Theorem 6.3. Define

    f(m; Q) =

    k2R

    +

    K(m; , )Q(dd).

    Let2 be a prior onM(MR+) containingF0 0 in its support. Then iff0 isHolder continuous and strictly positive onk2 , then

    2{P :f(.; P) KL(f0, )} >0.

    6.1. Posterior computation. In this section, we describe an exact blockGibbs sampling algorithm for posterior computation in Dirichlet process locationmixture of Complex Watson kernels using the mixture model in (6.2). The al-gorithm follows the general steps outlined in Section 2.3, with our goal being tosimulate from the posterior distribution offgiven an iid sample X1, . . . , X n fromfand obtain a Bayes estimate for f. For the location-scale mixture, the compu-tations are very similar and are left to the reader. The prior 1 on P is taken tobe DP(w0P0) with w0 = 1 and P0 = CW(0, 0) for some 0 k2 , 0 >0 whilethe prior 1 for is chosen to be the Inverse Gamma distribution with some fixedhyper-parametersa,b >0, i.e.,

    1(d) (1)a+1 exp(b1), >0.These prior choices cause posterior conjugacy as we shall see soon. In the algorithm

    described in Section2.3for sampling from the posterior distribution of the density,at any given iteration, the distinct location atoms sjare drawn from the conditionalposterior

    f(sj)

    i:Si=j

    K(Xi; sj , )P0(d

    sj)

    exp{y

    mj

    Zj+ 1

    0A0

    y}

  • 7/27/2019 Bhattacharya Nonparametric

    18/30

    18 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    whereyis some preshape ofsj ,mjis the number of observations allocated to cluster

    j in the current iteration, Zj is the average of the embedded sample corresponding

    to clusterj , i.e.Zj =

    1

    mj

    i:Si=j

    J(Xi)

    andA0 = J(0). This implies that

    sj |{X1, . . . , X n, S1, . . . , S n, } CB

    mj

    Zj+ 1

    0A0

    .

    Hence the CB prior P0 ensures conditional posterior conjugacy. We sample fromthis distribution by one of the methods described in Section 5.6. We draw fromits full conditional posterior

    g() n

    i=1

    K(Xi; i, )1(d)

    (1)n(k2)+a+1 exp{n + b S(n)

    j=1

    mjyj

    Zjyj

    1}

    1 e1/

    k3r=0

    1

    r!r

    n

    whereyjdenotes some preshape forsj ,j = 1, . . . , S (n). Forsmall, this conditional

    density is approximately equal to that of

    IG

    n(k 2) + a, b +

    S(n)

    j=1mj(1 yj Zjyj)

    .

    Hence we get approximate conjugacy for the conditional distribution of oncewe choose a IG prior . Numerical studies show that this approximation is veryaccurate even for moderately small. Hence an independent Metropolis Hastingsstep for updating , with candidates generated from the IG approximation, shouldbe highly efficient.

    7. Application to simulated data

    We draw an iid sample of size 200: X1, . . . , X n, n = 200, on the planar shapespace k2 , k = 4, from the density

    f0 = 0.5CW(1, 0) + 0.5CW(2, 0) with

    0 = .001, 1 = [(1, 0, 0)], 2= [(r,

    1

    r2, 0)] where r = .9975.

    We try three different density estimates for f0, namely a nonparametric (np)Bayesian density estimate as obtained in Section 6.1, a frequentist parametric es-timate and a kernel density estimate (KDE). We compare their performance byestimating the distance between the true density and the density estimate. Weuse two types of distances, namely the L1 distance and the Kullback-Leibler (KL)divergence. In turns out that the np Bayes estimate performs the best in both cases.

  • 7/27/2019 Bhattacharya Nonparametric

    19/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 19

    TheL1 divergence betweenf0 and another density f1 is given by

    d1(f0, f1) = k2 |

    f0(m)

    f1(m)

    |V(dm)

    which can be estimated consistently by

    d1(f0, f1) = 1

    n

    ni=1

    1 f1(Xi)f0(Xi) .

    The KL divergence betweenf0 andf1 is defined to be

    d2(f0, f1) =

    k2

    f0(m)log

    f0(m)

    f1(m)

    V(dm),

    a consistent estimator of which is given by

    d2(f0, f1) = 1

    n

    n

    i=1

    logf0(Xi)

    f1(Xi) .To get the Bayes estimate, we estimatef0using expression (2.8) averaged over a

    large number of iterations of the exact block Gibbs sampler described in Section 6.1for the DP location mixture of CW kernels model. To complete a specification ofthe model, we let P0 = CW(E, 0.1), with Ebeing the sample extrinsic mean.By using the data to estimate the center of the base distribution, while choosing amoderate variance, we ensure that the prior introduces clusters close to the supportof the data. This default leads to better performance than using a uniform basemeasure which is the limit of CW distributions as . The prior 1 for isset to be IG(1, .1) and the DP precision parameter is fixed as 0 = 1, which is acommonly-used default in the literature, which favors a sparse representation withfew clusters.

    We ran the Gibbs sampler for 100,000 iterations, with the first 15,000 discardedas a burn-in. Posterior summaries of the distances, including posterior means andcredible intervals are summarized as follows:

    d1 = 0.3374, 95%CI= (0.2308, 0.4538), 99%CI= (0.2009, 0.4931)

    d2 = 0.0669, 95%CI= (0.0234, 0.1227), 99%CI= (0.0135, 0.1426)

    To get a single kernel based frequentist density estimate, we fit a CW(, )distribution to the data, estimating and by their MLEs mle and mle respec-tively. Let Zdenote the embedded sample mean, letk1 denote its largest eigen

    value and let Uk1 be a corresponding unit eigen vector. It is shown in [4] that

    mle = [Uk1] which is the sample extrinsic mean and under high concentrations

    (i.e. lk1 close to 1) mle is approximately equal to 1k1

    k2 which is VE2(k2) where

    VEdenotes the sample extrinsic variation. Denoting the density estimate by fmleCW(mle, mle), the estimated distances from the true density f0 turn out to be

    d1(f0, fmle) = 0.7182, d2(f0, fmle) = 0.4727.

    Finally we use a frequentist KDE

    f(m) = 1

    n

    nj=1

    K(m; Xj , h)

  • 7/27/2019 Bhattacharya Nonparametric

    20/30

    20 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    with K as in (6.1) and fixed band-width h > 0. We may take h to be equal to0 or mle or the Bayes mean for the posterior distribution of in the model

    f(.; P, ). It turns out that mle= 0.0017 and = 0.0014. The values for d1(f0,

    f)and d2(f0, f) for various values ofh are shown in table 1. Also included are the

    performance of the np Bayes and single kernel estimates for a side by side com-parision. It shows that the nonparametric Bayes density estimate performs much

    Table 1. Estimated divergence from f0 for 3 density estimates

    KDE np Bayes fmleh d1 d2 d1 d2 d1 d2

    0.001 0.8404 0.2649

    0.3374 0.0669 0.7182 0.47270.0014 0.8473 0.48330.0017 0.8691 0.62380.0009 0.8548 0.20007

    better than the parametric estimate and the KDE.

    8. Application to morphometrics: classification of gorilla skulls

    In this real life example, eight landmarks are chosen on the midline plane of2D images of some gorilla skulls. There are 29 male and 30 female gorillas in thesample. The data can be found in Dryden and Mardia [4]. The goal is to studythe shapes of the skulls and use that to build a classifier to determine the sex ofa gorilla from its skulls shape. This finds application in morphometrics and otherbiological sciences.

    Figure1 shows the plot of the preshapes of the k-ads along with the preshapes

    of the sample extrinsic means for the two groups. The sample preshapes have beenrotated appropriately to bring them closest to the chosen preshapes for the means.Figure 2 plots the nonparametric Bayes estimates of the shape densities for thetwo groups along with 95% credible regions. These estimates were obtained usingthe same model, prior and computational algorithm applied in Section 7 for thesimulated data. The plots show the densities conditioned to the geodesic startingfrom the female groups mean shape and directed towards the male groups meanshape.

    To carry out a discriminant analysis, we randomly pick 25 shapes from eachsample as training data sets and the remaining 9 are used as test data. Then weestimate the shape densities independently from the test data for each sex, andfind the conditional probability of being female for each of the test sample shapes.

    If we denote by the prior probability of being female, by f1(m) and f2(m) thefemale and male predictive densities evaluated at a shape m, then the posteriorprobability of being female for the shape m given the training sample of shapes is

    p= f1(m)

    f1(m) + (1 )f2(m).

    We take = 0.5. Table 2 presents the posterior mean p ofp along with a 95%Credible Interval for p for each of the test sample shapes. In this table the first

  • 7/27/2019 Bhattacharya Nonparametric

    21/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 21

    (a)

    (b)

    Figure 1. (a) and (b) show 8 landmarks from skulls of 30 femaleand 29 male gorillas respectively along with the respective samplemean shapes. * correspond to the mean shapes landmarks.

  • 7/27/2019 Bhattacharya Nonparametric

    22/30

    22 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    0.1 0.05 0 0.05 0.1 0.150

    1

    2

    3

    4

    5

    6

    7x 10

    18

    Predictive densities:Female(), Male(..)

    Figure 2. Densities for gorilla shapes

    Table 2. Conditional prob. of being female given the shape anddistances from female & male mean shapes

    Truep CI d(., f) d(., m)Gender

    F 1 (1,1) .041 .1109F .9999 (.9992,1) .0362 .0934F .16 (.008,.602) .056 .0517

    F .9958 (.968, 1) .0495 .0952F 1 (1, 1) .0755 .135M .0001 (0, 0) .1672 .1033M .0005 (0, .003) .087 .0417M .983 (.8197, 1) .0911 .1207M .0003 (0, 0) .1523 .0935

    five shapes correspond to female gorillas while the last four are males. There issome uncertainty in the classification of sample 3 while sample 8 is misclassified.Figure3plots the preshapes of the test samples along with that of the mean shapesfrom the male and female groups.

    We may also build a distance based classifier by comparing the distance of anygiven shape from the mean shapes of the female and male training sets. Columns 4and 5 of table2present the extrinsic distance of each of the test sample shapes fromthe female and male extrinsic means respectively. Using this classifier, samples 3and 8 are misclassified. The disadvantage of using such a classifier is that it isdeterministic in nature - there is no measure for the uncertainity in classifying.

    9. Appendix

  • 7/27/2019 Bhattacharya Nonparametric

    23/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 23

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.4

    Test samples(.), female mean(), male mean(..)1

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.42

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.43

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.44

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.45

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.46

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.47

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.48

    0.5 0 0.50.3

    0.2

    0.1

    0

    0.1

    0.2

    0.3

    0.49

    Figure 3. Preshapes for test samples. Sample (.), Female mean(-), Male Mean (..)

    9.1. Proofs of Theorems2.2 and 2.4. To prove Theorems2.2 and 2.4,wewill need the following lemmas.

    Lemma 9.1. Iff0 is a continuous density onM, then

    (1) for any >0, there exists a (0, Ar1] such thatsupmM

    |f0(m) f(m; F0, )| < .

    (2) Furthermore iff0(m)> 0 m M, then we can choose such thatM

    f0(m)log

    f0(m)

    f(m; F0, )

    V(dm)< .

  • 7/27/2019 Bhattacharya Nonparametric

    24/30

    24 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    Proof. From Proposition2.1,it follows that

    f0(m) = M

    K(; m, )f0(m)V(d).

    In a normal neighborhood ofm, G(m) =Gm() and hence K is symmetric in mand. Also K(m; ., ) is zero outsideB (m, r). Therefore we can write

    (9.1) f(m; F0, ) f0(m) =B(m,r)

    K(m; , ){f0() f0(m)}V(d).

    For convenience let us use f(m) for f(m; F0, ). SinceB(m, r) lies in a normalneighborhood ofm, using normal coordinates into TmM, equation (9.1) simplifiesto

    f(m) f0(m) =dy0, there exists a >0 such that m1, m2 M,dg(m1, m2)< implies that|f0(m1) f0(m2)| < . In equation (9.3),

    f1(y) f1(0) =f0(expm(y)) f0(m).Note thatdg(m, expm(y)) = y A . Hence by choosing < A, we can ensurethat |f1(y) f1(0)| < and hence |f(m) f0(m)| < for allm in M. This proves(1).

    To prove (2), note that c0 = infmMf0(m) is strictly positive. Given any >0, choose in (1) to be such that supmM |f0(m) f(m)| < . TheninfmMf(m)> c

    >0 for sufficiently small. Hence

    M

    f0(m)log

    f0(m)f(m)

    V(dm) sup

    mM

    f0(m)f(m) 1 c0 <

    for sufficiently small. This completes the proof.

    Lemma 9.2. Given >0, if there exists

    (1) a> 0 and aP M(M) such thatsupmM

    |f0(m) f(m; P, )| < 3

    ,

    (2) a setWcontaining and1(W)> 0 such that

    supmM,W

    |f(m; P, ) f(m; P, )| < 3

    ,

    and(3) aW M(M) withP W and1(W)> 0 such thatsup

    mM,PW,W|f(m; P, ) f(m; P, )| <

    3,

    thensupmM

    |f0(m) f(m; P, )| < for all(P, ) W W.

  • 7/27/2019 Bhattacharya Nonparametric

    25/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 25

    Proof. Follows from a direct application of the triangular inequality.

    Proof of Theorem2.2. The result follows from Lemma9.2if we can verify

    conditions (1), (2) and (3) because then(1 1){(P, ) : sup

    mM|f0(m) f(m; P, )| < } 1(W)1(W)> 0.

    Condition (1) is verified from Lemma9.1(1) withP= F0. Since 0 supp(1) and1({0}) = 0, we can choose sufficiently small so that supp(1).

    Next we need to find a W for which condition (2) is satisfied. First we show

    that K(m; , ) = dX(dg(m,) )G1 (m) is a continuous function of (m,,) onM M (0, Ar1] (under the product topology). We prove that as follows. Firstlynote that (m, )dg(m, ) is continuous on M M . SinceX is continuous on[0, ), therefore (m,,) X

    dg(m,)

    is continuous on M M (0, ). Also

    since (m, )

    G(m) is a non-zero continuous function on

    {(m, ) M M : dg(m, )< r},therefore G1 (m) is also continuous in the above set. Therefore K(m; , ) iscontinuous on

    {(m, ) M M : dg(m, ) r1} (0, ).SinceK(m; , ) = 0 ifdg(m, ) r1, thereforeKis continuous onMM(0, )and hence uniformly continuous on M M [2 , Ar1] (under the L1 metric) andbounded on this set, say by K. This implies that Kis uniformly equicontinuouson [2 , Ar1]. Hence we can get a compact setW [2 , Ar1] containing in itsinterior such that

    |K(m; , ) K(m; , )| < 3(m,,) M M W.

    Then

    supmM,W|f(m; F0, ) f(m; F0, )| MsupmM,W|K(m; , ) K(m; , )|f0()V(d) supm,M,W|K(m; , ) K(m; , )| < 3 .

    Since supp(1) and Wcontains an open neighborhood of, therefore 1(W)>0. This verifies condition (2).

    Lastly we need to find aWfor which condition (3) is satisfied. We claim thatW= {P : sup

    mM,W|f(m; P, ) f(m; F0, )| <

    3}

    contains a weakly open neighborhood ofF0. To prove this claim, note that for anym

    M,

    W,

    K(m; , ) defines a bounded continuous function on M.

    HenceWm,= {P : |f(m; P, ) f(m; F0, )| <

    9}

    defines a weakly open subset ofM(M) for all (m, ) M W. Now we show that(m, )f(m; P, ) is a uniformly equicontinuous family of functions on M Wlabeled byP M(M). That is because, for m1, m2 M; , W,

    |f(m1; P, ) f(m2; P, )| M

    |K(m1; , ) K(m2; , )|P(d)

  • 7/27/2019 Bhattacharya Nonparametric

    26/30

    26 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    andKis uniformly continuous onMMW. Therefore there exists a >0 suchthatdg(m1, m2) + | | < implies that

    supPM(M)

    |f(m1; P, ) f(m2; P, )| < 9 .

    Cover M Wby finitely many balls of radius : M W =Ni=1 B((mi, i), ).LetW1 =N

    i=1 Wmi,i which is an open neighborhood ofF0. Let P W1 and(m, ) M W. Then there exists a (mi, i) such that (m, ) B((mi, i), ).Then

    |f(m; P, ) f(m; F0, )||f(m; P, ) f(mi; P, i)| + |f(mi; P, i) f(mi; F0, i)| + |f(mi; F0, i) f(m; F0, )| 0. Hence condition (3) is satisfied. This completesthe proof.

    Proof of Theorem2.4. From Lemma9.1 it follows that given any 1 >0,we can find a 1 > 0 such that with P1 = F0 1 ,

    supmM

    |f0(m) f(m; P1)| < 1 andM

    f0(m)log

    f0(m)

    f(m; P1)

    V(dm)< 1.(9.4)

    Hence if we choose 1 c02 wherec0 = infmMf0(m)> 0 then infmMf(m; P1) c12. Since F0 0 supp(2) and 2({F0 0}) = 0, we can choose 1 sufficiently

    small so that P1 supp(2). Get a compact set E in (0, ) containing 1 in itsinterior. From the proof of Theorem2.2, it follows that K(m; , ) is continuous,

    hence uniformly continuous on M M E. For P M(M (0, Ar1]), definef(m; PE) =

    ME

    K(m; , )P(dd).

    Denote by A, the boundary of any set A. Since M is a manifold, it has noboundary, hence (M E) = M E. Since (, ) K(m; , ) is uniformlyequicontinuous as a family of functions labeled bym MonMEand P1{(ME)} = P1(M E) = 0, therefore for 2 > 0,

    Wm(2) = {P : |f(m; PE) f(m; P1)| < 2}defines a weakly open neighborhood ofP1. Since P1 supp(2), therefore 2(Wm(2))>0. We also claim that if

    W=

    {P : sup

    mM|f(m; PE)

    f(m; P1)

    |< 2

    },

    then 2(W)> 0. To see that get 3 > 0 such thatdg(m1, m2)< 3 implies that

    sup(,)ME

    |K(m1; , ) K(m2; , )|

  • 7/27/2019 Bhattacharya Nonparametric

    27/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 27

    CoverMby finitely many balls of radius 3: M=N

    i=1 B(mi, 3). Then we show

    thatW Ni=1 Wmi( 23). To prove that pick P

    Ni=1 Wmi( 23). Then

    |f(mi; PE) f(mi; P1)| < 2i= 1, 2, . . . , N .Pick m M, say m B(mi, 3). Equation (9.5) implies that

    |f(m; PE) f(mi; PE)| < 2P.Hence

    |f(m; PE) f(m; P1)||f(m; PE) f(mi; PE)| + |f(mi; PE) f(mi; P1)| + |f(mi; P1) f(m; P1)|

    0. For P W,inf

    mMf(m; PE) inf

    mMf(m; P1) 2 c1

    4

    if2 < c1

    4 . ThenM

    f0(m)log

    f(m; P1)

    f(m; P)

    V(dm)0.Since was arbitrary, the proof is completed.

    9.2. Proof of Lemma 6.1.

    . SinceK is symmetric inm and , therefore K(m; , )V(dm) =

    K(m; , )V(d)

    Hence we can write|f(m; F0, ) f0(m)| as|

    k2K(m; , )f0()V(d)

    k2K(m; , )f0(m)V(d)|

    = | k2{f0() f0(m)}K(m; , )V(d)|.(9.7)

  • 7/27/2019 Bhattacharya Nonparametric

    28/30

    28 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    Letx and y be some preshapes for m and respectively in CSk2, so thatm = [x]and = [y]. Let V1 denote the volume-form on CS

    k2. Then for any integrablefunction : k

    2R,

    k2

    (m)V(dm) = 1

    2

    CSk2

    ([x])V1(dx).

    Hence the integral in (9.7) can be written as

    c1()

    2

    CSk2

    {f0([y]) f0([x])} exp(1yxxy)V1(dy) .(9.8)

    Consider a s.v.d. of xx as xx = UU where = diag(1, 0, . . . , 0) and U =[U1, . . . , U k1] with U1 = x. Then

    yxxy= zz= |z1|2where

    z= U

    y= (z1, . . . , zk1)

    .Make a change of variable yz in (9.8). This does not change the volume formbecause of it being an orthogonal transformation. Then (9.8) becomes

    c1()

    2

    CSk2

    {f0([U z]) f0([x])} exp(1|z1|2)V1(dz) .(9.9)

    Write zj =

    rjeij , j = 1, . . . , k 1, where r = (r1, . . . , rk1) Sk2 and =

    (1, . . . , k1) [0, 2)k1, then

    V1(dz) = 22kdr1 . . . d rk2d1 . . . d k1.

    Hence (9.9) can be written as

    c1()121k Sk2[0,2)

    k1

    {f0([y(r,,x)]) f0([x])} expr1

    drd(9.10)

    where

    y y(r,,x) =k1j=1

    rje

    ijUj .

    Then

    d2E([y], [x]) = 2(1 r1).By the Holder continuity off0, we get that

    |f0([y]) f0([x])| A(1 r1)for some A, >0. Then from (9.10), we deduce that

    supmk2 |f(m; F0, )| f0(m)|(9.11) c1()121kA Sk2[0,2)k1(1 r1) exp r1 drd

    = k2

    (k3)! c1()A1

    0(1 r1)+k3 expr1

    dr1

    = k2

    (k3)! c1()k2+e1/A1

    0 essk3+ds

    g() 0

    essk3+ds

  • 7/27/2019 Bhattacharya Nonparametric

    29/30

    NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON MANIFOLDS 29

    with

    g() = k2

    (k

    3)!c1()k2+e1/A.

    Hence (9.11) converges to zero ifg() 0 as 0. Using the expression forc(), g() can be written as

    g() = (k 3)!1e1/[e1/ k3r=0r!1r]1= (k 3)!1[1 k3r=0e1/r!1r]1.

    Since 1 k3r=0e1/r!1r 1 and 0 as 0, therefore g ()0and this completes the proof.

    References

    [1] Bhattacharya, A. and Bhattacharya, R. (2008). Nonparametric Statistics on Manifoldswith Applications to Shape Spaces. Pushing the Limits of Contemporary Statistics: Contri-butions in honor of J.K. Ghosh. IMS Collections 3 282-301.

    [2] Bhattacharya, R. N. and Patrangenaru, V. (2003). Large sample theory of intrinsic andextrinsic sample means on manifolds-I. Ann. Statist. 31 1-29.

    [3] Bush & Maceachern, S.N.(1996). A semiparametric Bayesian model for randomised blockdesigns. Biometrika 83 275-285.

    [4] Dryden, I. L. & Mardia, K.V. (1998). Statistical Shape Analysis. Wiley N.Y.[5] Escobar, M.D. & West, M. (1995). Bayesian density-estimation and inference using mix-

    tures. J. Am. Statist. Assoc. 90 577-588.

    [6] Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist.1 209-230.

    [7] Ferguson, T.S. (1974). Prior distributions on spaces of probability measures. Ann. Statist.2, 615-629.

    [8] Ghosal, S., Ghosh, J.K. and Ramamoorthi, R.V.(1999). Posterior consistency of Dirichletmixtures in density estimation. Ann. Statist. 27, 143-158.

    [9] Ishwaran, H. & James, L.F. (2001). Gibbs sampling methods for stick-breaking priors. J.

    Am. Statist. Assoc. 96, 161-73.

    [10] Kent, J.T. (1994). The complex Bingham distribution and shape analysis. J. Roy. Statist.Soc. Ser. B 56 no.2, 285-299.

    [11] Kim, J., Cetin, M. & Willsky, A.S. (2007). Nonparametric shape priors for active contour-based image segmentation. Signal Processing 87, 3021-3044.

    [12] Kume, A. & Walker, S.G. (2006). Sampling from compositional and directional distribu-tions. Statist. Comput. 16, 261-265.

    [13] Lennox, K.P., Dahl, D.B., Vannucci, M. & Tsai, J.W. (2009). Density estimation forprotein configuration angles using a bivariate von Mises distribution and Bayesian nonpara-

    metrics. J. Am. Statist. Assoc., to appear.[14] Lo, A.Y.(1984). On a class of Bayesian nonparametric estimates - 1. Density estimates. Ann.

    Statist. 12, 351-357.

    [15] MacEachern, S.N. (1994). Estimating Normal means with a conjugate style Dirichlet Pro-cess prior. Commun. in Statist.-simulation and computat. 23, 727-741.

    [16] Mardia, K.V., Taylor, C.C. & Subramaniam, G.K. (2007). Protein bioinformatics andmixtures of bivariate von Mises distributions for angular data.Biometrics 63, 505-512.

    [17] Neal, R.M. (2000). Markov chain sampling methods for Dirichlet process mixture models.

    J. Comp. Graph. Statist., 9, 249-265.[18] Papaspiliopoulos (2008). A note on posterior sampling from Dirichlet mixture models.

    Working Paper, 08-20, Centre for Research in Statistical Methodology, Uni. Warwick, Coven-try, U.K.

    [19] Papaspiliopoulos & Roberts (2008). Retrospective Markov chain Monte Carlo methods forDirichlet process hierarchical models. Biometrika 95 169-186.

    [20] Pelletier, B. (2005). Kernel density estimation on Riemannian manifolds. Stat. & Prob.

    Letters 73 297-304.[21] Schwartz, L. (1965). On Bayes procedures. Z. Wahrsch. Verw. Gebiete 4 10-26.

  • 7/27/2019 Bhattacharya Nonparametric

    30/30

    30 ABHISHEK BHATTACHARYA AND DAVID DUNSON

    [22] Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639-650.

    [23] Stephens, M. (2000). Dealing with label switching in mixture models. J.R. Stat. Soc. Ser.

    B Stat. Methodol. 62 795-809.[24] Walker, S.G. (2007). Sampling the Dirichlet mixture model with slices. Communications in

    Statistics: Simulation and Computation, 36, 45-54.

    [25] Watson, G.S. (1965). Equitorial distributions on a sphere Biometrika 52 193-201.[26] Watson, G.S. (1983). Statistics on spheres. University of Arkansas Lecture Notes in the

    Mathematical Sciences, 6. Wiley, NY.

    [27] West, M., Muller, P. & Escobar, M.D. (1994). Hierarchical priors and mixture models,with application in regression and density estimation.

    [28] Willmore, T. (1993). Riemannian Geometry. Oxford Uni. Press, Oxford.[29] Wu, Y. & Ghosal, S. (2008). Kullback-Leibler property of kernel mixture priors in Bayesian

    density estimation. Electronic Journal of Statistics, 2, 298-331.[30] Yau, C., Papaspiliopoulos, O., Roberts, G.O. & Holmes, C. (2008). Bayesian nonpara-

    metric hidden Markov models with application to the analysis of copy-number-variation in

    mammalian genomes. Working Paper, Oxford-Man Institute, Uni. of Oxford, Oxford, U.K.

    Department of Statistical Science, Duke University, Durham, NC, USA


Recommended