+ All Categories
Home > Documents > On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta,...

On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta,...

Date post: 02-Jan-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
14
Stat Comput (2009) 19: 367–380 DOI 10.1007/s11222-008-9098-3 On projection-based tests for directional and compositional data Juan A. Cuesta-Albertos · Antonio Cuevas · Ricardo Fraiman Received: 13 March 2008 / Accepted: 3 September 2008 / Published online: 25 September 2008 © Springer Science+Business Media, LLC 2008 Abstract A new class of nonparametric tests, based on ran- dom projections, is proposed. They can be used for several null hypotheses of practical interest, including uniformity for spherical (directional) and compositional data, spheric- ity of the underlying distribution and homogeneity in two- sample problems on the sphere or the simplex. The proposed procedures have a number of advantages, mostly associated with their flexibility (for example, they also work to test “partial uniformity” in a subset of the sphere), computational simplicity and ease of application even in high-dimensional cases. This paper includes some theoretical results concerning the behaviour of these tests, as well as a simulation study and a detailed discussion of a real data problem in astronomy. Keywords Compositional data · Directional data · Sphericity · Uniformity J.A. Cuesta-Albertos ( ) Departamento de Matemáticas, Estadística y Computación, Univ. de Cantabria, Santander, Spain e-mail: [email protected] A. Cuevas Departamento de Matemáticas, Univ. Autónoma de Madrid, Madrid, Spain R. Fraiman Departamento de Matemática, Univ. de San Andrés, Buenos Aires, Argentina R. Fraiman Centro de Matemática, Univ. de la República, Montevideo, Uruguay 1 Introduction 1.1 Uniformity tests for directional data According to the usual terminology, directional data are those whose sample space is the unit sphere, S d 1 , of the space R d endowed with the usual Euclidean norm, that is S d 1 ={x :x 2 = 1}, where x 2 = ( d i =1 x 2 i ) 1/2 , with x = (x 1 ,...,x d ) t . The books by Mardia and Jupp (2000) and Fisher (1993) are classical references on the subject. Of course the cases d = 2 (circular data) and d = 3 (spherical data) are particu- larly important. As Mardia and Jupp (2000, p. 1) point out, circular data come often from the compass (e.g., wind direc- tions) or the clock (e.g., arrival times of patients to a medical service). Also, astronomy is a continual source of spherical data: an example is discussed in Sect. 5 below. The study of directional data requires a special statistical theory, in some sense “parallel” to the classical one, adapted in order to cope with the fact that in directional data the sam- ple space is a manifold that differs in many respects from the Euclidean space. Thus, some basic concepts as mean, variance and density estimators must be carefully redefined. Likewise, most standard models of multivariate statistics, in particular the Gaussian model, are not straightforwardly ex- tended to the directional case. However, we will be mainly concerned here with the uniform distribution whose defin- ition in the directional case poses no particular difficulty. It is just the probability distribution with constant density on the sphere. More precisely, in this paper we propose a new procedure for the usual task of testing uniformity, which in many cases arises as a first natural step in the statistical analysis, before going further in the search for data structure. Two classical procedures for testing uniformity in direc- tional data are Kuiper’s test (e.g., Mardia and Jupp 2000,
Transcript
Page 1: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

Stat Comput (2009) 19: 367–380DOI 10.1007/s11222-008-9098-3

On projection-based tests for directional and compositional data

Juan A. Cuesta-Albertos · Antonio Cuevas ·Ricardo Fraiman

Received: 13 March 2008 / Accepted: 3 September 2008 / Published online: 25 September 2008© Springer Science+Business Media, LLC 2008

Abstract A new class of nonparametric tests, based on ran-dom projections, is proposed. They can be used for severalnull hypotheses of practical interest, including uniformityfor spherical (directional) and compositional data, spheric-ity of the underlying distribution and homogeneity in two-sample problems on the sphere or the simplex.

The proposed procedures have a number of advantages,mostly associated with their flexibility (for example, theyalso work to test “partial uniformity” in a subset of thesphere), computational simplicity and ease of applicationeven in high-dimensional cases.

This paper includes some theoretical results concerningthe behaviour of these tests, as well as a simulation study anda detailed discussion of a real data problem in astronomy.

Keywords Compositional data · Directional data ·Sphericity · Uniformity

J.A. Cuesta-Albertos (�)Departamento de Matemáticas, Estadística y Computación, Univ.de Cantabria, Santander, Spaine-mail: [email protected]

A. CuevasDepartamento de Matemáticas, Univ. Autónoma de Madrid,Madrid, Spain

R. FraimanDepartamento de Matemática, Univ. de San Andrés, BuenosAires, Argentina

R. FraimanCentro de Matemática, Univ. de la República, Montevideo,Uruguay

1 Introduction

1.1 Uniformity tests for directional data

According to the usual terminology, directional data arethose whose sample space is the unit sphere, Sd−1, of thespace R

d endowed with the usual Euclidean norm, that isSd−1 = {x : ‖x‖2 = 1}, where ‖x‖2 = (

∑di=1 x2

i )1/2, withx = (x1, . . . , xd)t .

The books by Mardia and Jupp (2000) and Fisher (1993)are classical references on the subject. Of course the casesd = 2 (circular data) and d = 3 (spherical data) are particu-larly important. As Mardia and Jupp (2000, p. 1) point out,circular data come often from the compass (e.g., wind direc-tions) or the clock (e.g., arrival times of patients to a medicalservice). Also, astronomy is a continual source of sphericaldata: an example is discussed in Sect. 5 below.

The study of directional data requires a special statisticaltheory, in some sense “parallel” to the classical one, adaptedin order to cope with the fact that in directional data the sam-ple space is a manifold that differs in many respects fromthe Euclidean space. Thus, some basic concepts as mean,variance and density estimators must be carefully redefined.Likewise, most standard models of multivariate statistics, inparticular the Gaussian model, are not straightforwardly ex-tended to the directional case. However, we will be mainlyconcerned here with the uniform distribution whose defin-ition in the directional case poses no particular difficulty.It is just the probability distribution with constant densityon the sphere. More precisely, in this paper we propose anew procedure for the usual task of testing uniformity, whichin many cases arises as a first natural step in the statisticalanalysis, before going further in the search for data structure.

Two classical procedures for testing uniformity in direc-tional data are Kuiper’s test (e.g., Mardia and Jupp 2000,

Page 2: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

368 Stat Comput (2009) 19: 367–380

p. 99), valid for the case d = 2, and Rayleigh’s test, whichis asymptotic but can be applied for directional samplesof any dimension (e.g., Mardia and Jupp 2000, p. 99 andp. 207). The first one is an “universal test”, consistent againstall alternatives; it relies on ideas similar to those of theKolmogorov-Smirnov (K-S) procedure. Rayleigh’s test israther designed to detect unimodal alternatives (Fisher 1993,p. 69) and it is not universally consistent. In fact, it is thelikelihood ratio test for testing a simple hypothesis withinthe von Mises family. It can be adapted to the case that themean direction of the alternative distribution is given in ad-vance.

Another more recent proposal, which will be also con-sidered in the comparisons below, is Giné’s (1975) test. Itis based on an asymptotic approximation but it is universaland can be applied for any dimension d ; see Mardia andJupp (2000, p. 209).

1.2 A new proposal based on random projections

We present here a new family of uniformity tests for direc-tional data based on the use of projections along randomdirections, as proposed in Cuesta-Albertos et al. (2007b)(CFR07, in the sequel). Roughly speaking, in that paper it isshown that the distribution of a random element (even tak-ing values in an infinite-dimensional space) is determined bythat of a one-dimensional projection along just a randomlychosen direction. For a precise statement see Theorem 4.1in CFR07. Usually this approach leads to a conditional test,in the sense that the distribution of the projections under thenull hypothesis depends on the on the chosen direction. Thesituation is simpler in the problem of testing uniformity fordirectional data since, as we will see, the distribution of theprojections under the null hypothesis of uniformity does notdepend on the obtained direction and admits a closed simpleexpression. Thus, in this case, we have an unconditional test.

Let us emphasize however that the tests based on ran-dom projections are randomized, in the sense that the finaldecision (acceptance or rejection) for a given sample couldvary depending on the chosen direction, even if all the pro-jection distributions coincide. This is a intrinsic feature ofthe random projections methodology; see Cuesta-Albertoset al. (2007a) for an application to goodness of fit. The incor-poration of external randomness is not unusual in statistics.It arises for example when bootstrap-based tests are usedand the null distribution of the bootstrap statistic is approx-imated by resampling. In a more classical setting, random-ized tests arise even in the Neyman-Pearson lemma in orderto guarantee that the desired significance level is in fact at-tained. Anyway, the effect of the external randomness in ourprojection procedures can be reduced by taking several in-dependent random directions and combining the individualresults. In the case of hypothesis testing the simplest way

to do this is the classical Bonferroni’s device but other moresophisticated approaches are sometimes possible. For exam-ple in the problem of testing uniformity for directional dataconsidered in this paper, we have combined several individ-ual tests by using the minimum p-value as a test statistic,following the ideas in Berk and Jones (1978). As we willshow, the combination of several tests greatly outperformthe simplest version relying on just one random direction.

Another way to avoid the presence of randomness is con-sidered in Cuevas and Fraiman (2008) who study the use (inestimation and classification problems) of a depth based onthe global average of the one-dimensional depth measurescalculated over all possible directions.

Thus the random projection methodology is still far frombeing a closed proposal. In this work we will try to showthat it is also useful in a number of problems, besides test-ing uniformity, especially when we have to deal with high-dimensional data. In particular, we will discuss how to use itin order to test the sphericity of the underlying distributionaround a given point as well as to test the homogeneity (i.e.,the coincidence) of two distributions in a two-sample prob-lem with directional data. Still, the paper is mainly orientedto the uniformity problem just to focus the discussion in asituation where the null distributions of the test statistics areexplicitly known and particularly simple.

While it is clear that the high-dimensional directionaldata appear less frequently in practice than those on thecircle or the unit sphere, they are also interesting in somepractical situations; see for example Juan and Prieto (2001)where a test of uniformity for Sd−1-valued data is proposedin connection with a problem of outlier detection.

1.3 On directional and compositional data

A further application of high-dimensional data on the spherearises in the study of problems involving proportions, thatis, those situations in which the observed variable is of typex = (x1, . . . , xd)t ∈ R

d , with xj ≥ 0 and∑

xj = 1, xj be-ing, for example, the proportion of time devoted to the j -thactivity by a worker or the proportion of the j -th compo-nent in the soil. These are the so-called “compositional data”which have received considerable attention, especially mo-tivated by their applications in geology and chemistry; seeAitchison and Egozcue (2005) for a recent survey. Stephens(1982) analyzes a direct connection between directional andcompositional data given by the change of variable

√xj =

ξj which takes the vector of proportions (x1, . . . , xd) to apoint (ξ1, . . . , ξd) in the “positive part” of Sd−1. We explorehere a different connection, based on the simple observationthat the compositional data are just elements of the posi-tive quadrant in the unit sphere of the space R

d endowedwith the L1 norm, ‖x‖1 = ∑ |xj |. Hence, in a way, com-positional data are also directional when viewed in the ap-propriate space. We will show below that this connection

Page 3: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

Stat Comput (2009) 19: 367–380 369

is in fact deeper than suggested by the formal replacementof ‖.‖2 with ‖.‖1: Our main results, Theorems 2.2 and 4.2(which provide the support for projection-based tests of uni-formity and homogeneity), cover also the case of composi-tional data.

1.4 Organization of the paper and notation

Our uniformity projection tests are introduced in Sect. 2.A simulation study, devoted only to the problem of test-ing uniformity on S1 and S2, appears in Sect. 3: Differentversions of the projection tests are compared with Kuiper’s,Rayleigh’s and Giné’s tests. Sphericity and homogeneitytests are studied in Sect. 4. A real data problem concerningthe uniformity of the cometary orbits is discussed in Sect. 5.All the proofs are deferred to the Sect. 6 which is a technicalappendix.

We will assume that all the random variables are definedon the same rich enough probability space (�,σ,P ). Giventhe random vectors X and Y, PX will denote the distributiongenerated by X, the expression X ∼ Y will mean that PX =PY, and P [X ∈ B|Y] will denote the conditional probabilitythat [X ∈ B] given Y.

2 The uniformity random projection testsfor directional data

In this section we present the theoretical developments onwhich our procedures are based.

2.1 The random projection tests based on a uniquedirection (RP1): Theoretical basis

Let X1, . . . ,Xn be a random sample of Sd−1-valued randomvariables. Denote by U another random variable, indepen-dent from the Xi ’s and uniformly distributed on Sd−1. LetY1 = Xt

1U, . . . , Yn = XtnU be the one-dimensional projec-

tions of the directional data Xi on the random direction U.As we will see, the distribution of the random variables

Yi uniquely determines, with probability one (in a sense tobe specified in the first part of Theorem 2.2), the distribu-tion of X1. Thus, almost surely, a consistent test for the nullhypothesis

H0 : The distribution of X1 is uniform on Sd−1

would be obtained by just testing

H ∗0 : The distribution function of Y1 is F0,

where F0 denotes the distribution function of the projectionY 1 = Xt

1U under H0. The hypothesis H ∗0 could be tested us-

ing a suitable goodness-of-fit test as for example the clas-sical, universally consistent, Kolmogorov-Smirnov proce-dure.

Thus, our first proposal to test H0 can be summarized asfollows:

(a) Given the directional data (on Sd−1) X1, . . . ,Xn, selectat random (uniformly) a direction U in Sd−1. A sim-ple way to obtain U with a standard software package isto take U = Z/‖Z‖, Z being a d-dimensional standardGaussian distribution.

(b) Compute the projections Y1 = Xt1U, . . . , Yn = Xt

nU.(c) Compute the empirical distribution Fn of the Yi and the

K-S statistic

Dn = supx

|Fn(x) − F0(x)|.

(d) Reject H ∗0 , and consequently H0, at a significance level

α, whenever Dn > Cn,α , where Cn,α denotes the appro-priate critical value extracted from the distribution of theK-S statistic.

We will call this procedure the “RP1 test” (i.e., RandomProjection test based on a unique direction). It can be seen asa particular case of a general methodology of (multivariateand functional) inference based on projections: See Cuesta-Albertos et al. (2006, 2007a).

The distribution F0 can be calculated explicitly (see Juanand Prieto 2001) or can be easily approximated, with arbi-trary precision, by drawing very large uniform samples onSd−1. However, in the cases d = 2 and d = 3, the distribu-tion function F0 has a particularly simple form which easesconsiderably the computations as shown in the followingproposition. This proposition includes some known resultson the distributions of the projections which can be checkedby direct calculation. Further results about projections anduniform distributions on the sphere appear in Brown et al.(1986).

Proposition 2.1 (1) In the case d = 2 the distribution func-tion F0 is

F0(t) = π − acos(t)

π, for all t ∈ (−1,1),

where acos(t) denotes the only value t ∈ [0,π] whose cosineis t .

(2) In the case d = 3 the distribution F0 is uniform in theinterval [−1,1].

Our first theoretical result is the following theorem whosefirst statement provides the basis for the proposed uniformitytest. The second statement extends the result for composi-tional data.

Theorem 2.2 (1) Let X be a Sd−1-valued random variableand U be a random variable, independent from X, with uni-form distribution on Sd−1.

Page 4: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

370 Stat Comput (2009) 19: 367–380

With probability one, the distribution of X is uniquely de-termined by the distribution of the projection of X on U, inthe sense that, if X1 and X2 are Sd−1 valued random vari-ables, independent from U and the probability that U be-longs to the set {u : Xt

1u ∼ Xt2u} is positive, then X1 and X2

are identically distributed.(2) The above statement is also true if the unit sphere

Sd−1 corresponding to the Euclidean norm is replaced withthe unit ‖ · ‖1-sphere Sd−1

1 .

Remark 2.2.1 Let us emphasize the importance of the ran-domness in Theorem 2.2. Clearly, if X1 and X2 are twoSd−1-valued random variables such that, for a fixed direc-tion u0 (for instance, u0 = (1,0, . . . ,0)t ), Xt

1u0 and Xt2u0

have the same distribution, there is no guarantee that X1 andX2 are identically distributed. The interest of Theorem 2.2lies in the fact that, if we choose u0 at random, we can besure that X1 and X2 are identically distributed wheneverXt

1u0 and Xt2u0 have the same distribution.

Remark 2.2.2 It is not essential that the distribution of U isuniform. Other possible continuous distributions with sup-port on Sd−1 could do the same job. However, we choosethe uniform distribution for obvious reasons of simplicity.

Remark 2.2.3 The RP1 procedure (and all the extensionsconsidered below) can also be adapted, with obvious mod-ifications, to test hypotheses of uniformity on some subsetsof Sd−1. For example, if the support of X is the “Northern"(upper) hemisphere Sd−1

N , and we want to test the unifor-mity of X in Sd−1

N , we may perform the RP1 test evaluatingthe distribution of a random projection under H0. This eval-uation can be carried out by Monte Carlo simulation or bydirect calculation, but taking into account that, in general,the distribution of the projection will depend on the chosendirection.

In a similar way we can test the hypothesis of homogene-ity (equal distribution) of two Sd−1-valued random variables(see Sect. 4).

2.2 The random projection tests based on k directions(RPk)

The projection direction U in the RP1 test is randomly cho-sen, which suffices to perfectly identify the distribution of Xwith probability one and obtain a consistent test. However,under the alternative, this also entails the risk of getting anunfortunate direction u0 where there is not much differencebetween the distribution of the projection of X along u0 andF0. Thus, with a certain positive probability, we can have atest of low power (conditional on U = u0). A natural thing todo in this respect is to consider several independent random

directions U1, . . . ,Uk along which the original data couldbe successively projected. Then, we could perform the RP1test for each of these directions and combine appropriatelythe results to get an overall test. More precisely, denote byP1, . . . ,Pk the p-values of the RP1 tests performed with theoriginal data projected along the directions U1, . . . ,Uk , re-spectively. As these tests are not in general independent, wecould adopt the classical Bonferroni procedure. This wouldlead to reject H0 at a level α whenever minj Pj ≤ α/k.

However, it is well known that the Bonferroni procedureis quite conservative. An alternative is to use τn = minj Pj

itself as a test statistic, provided that its null distributionis known or can be numerically approximated. Accordingto Berk and Jones (1978), this test enjoys some optimalityproperties. In our case we have used Monte Carlo simula-tions to approximate its conditional distribution (given theobtained directions U1, . . . ,Uk) under the null hypothesis.

We will call this method the “RPk test”.

Remark 2.2.4 A possibility suggested by an anonymous ref-eree is to consider the statistic

M = sup{|Fn(t |u) − F0(t)| : u ∈ Sd−1, t ∈ [−1,1]

},

where Fn(·|u) is the empirical distribution function of thedata projected onto u and F0 is the cumulative distributionfunction of the projections under the null hypothesis. Thiscan be related to the RP methodology through the sequenceof statistics

Mk = M(U1, . . . ,Uk)

= sup {|Fn(t |Ui ) − F0(t)| : i = 1, . . . , k, t ∈ [−1,1]} .

Indeed, it can be shown that Mk is an increasing sequencewhich converges a.s. to M as k → ∞ (see the technicalappendix for a proof). It can be also noted that the statis-tic M is related to the projection pursuit method which,in general, is computationally quite expensive. A compar-ison between this kind of statistic and the random projec-tion method in a problem of goodness of fit to a bivariateGaussian distribution can be seen in Sect. 5.1 in Cuesta-Albertos et al. (2007a). These authors obtain similar powersusing k = 40 randomly chosen directions and the projectionpursuit method approximated through 15,000 projections.

Remark 2.2.5 A further version of our procedure wouldarise by projecting on a number k (given in advance) of di-rections chosen in a “random-systematic” way, as follows:The first direction u1 is randomly chosen but the remainingones are systematically selected from u1 in order to covera wide range of different directions to project, thus takingbenefit of the consistency provided by the random choice butavoiding the possible “bad luck” in the selection of redun-dant directions. We have checked by simulation, different

Page 5: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

Stat Comput (2009) 19: 367–380 371

versions of this mixed, random-systematic procedure (ford = 2,3) but the results were inferior to those obtained withthe RPk methods based on a large number k of projections(k = 25 for d = 2; k = 50, 100 for d = 3). Thus we do notinclude them here in order to keep the focus on the mainideas of the random projections methodology.

3 A simulation study

In this section we will study the performance of the differenttests RPk through a simulation study, restricting ourselvesto the most common cases d = 2 and d = 3. The high-dimensional situation d > 3 will be considered and moti-vated in Sect. 4, in connection with the problem of testingsphericity and homogeneity.

The details of the simulation study are as follows:

Tests to be compared In all cases, the aim is to check thenull hypothesis of uniformity for the underlying distribu-tion which generates the data. In the case of circular data(d = 2) we compare the random projection tests RP1, RP2,RP3, RP5, RP10, and RP25 and the classical Kuiper’s andRayleigh’s tests mentioned in the introduction. For Kuiper’stest we have used the implementation included in the Rpackage Circular, developed by Lund and Agostinelli(2006).

As for Rayleigh’s test, we use the improved version ofthis test described in Mardia and Jupp (2000, p. 207): Thetest statistic is(

1 − 1

2n

)

T + 1

2n(d + 2)T 2,

where T = dn‖X̄‖22, X̄ = 1

n

∑ni=1 Xi . The asymptotic distri-

bution of this statistic under the uniform distribution is χ2d

and the approximation error is of order O(n−2).For spherical data (d = 3) we employ the random projec-

tion tests RP1, RP5, RP10, RP25, RP50 and RP100. Theseare compared, again, with the Rayleigh test but Kuiper’s test(which is not easily extended for data on the sphere Sd−1

with d > 2) is in this case replaced with Giné’s test. Thistest can be used in different versions. We have chosen theimplementation based on the statistic Fn defined in Mardiaand Jupp (2000, p. 209).

In the choice of the tests for comparison, the idea is toassess in each case (d = 2,3) the performance of our RPktests against two of the most popular “classical” competi-tors. For d = 2 the choice of Kuiper’s test seems quite nat-ural, in view of its universal consistency and relative per-formance when compared with other tests; see e.g. Mardiaand Jupp (2000, p. 115). Giné’s procedure plays a similarrole in the high dimensional cases. Finally, the choice ofRayleigh’s test looks quite natural, as it is maybe the most

popular uniformity test in circular statistics and can be usedin any dimension. Moreover, it enjoys interesting optimal-ity properties against von Mises alternatives (see Mardia andJupp 2000, pp. 95–96) as well as good practical performance(Figueiredo 2007).

Underlying distributions All the considered tests are usedunder the null hypothesis and under several alternative dis-tributions, namely:

1. The model M1: This is a sort of location projected nor-mal model, which can be seen as a particular case of theso-called angular Gaussian or offset normal; see Mardiaand Jupp (2000, p. 46). The distributions in this model areindexed by a real parameter b. In the case d = 2 the dataare obtained by projecting on the unit circumference S1

observations drawn from the bivariate Gaussian distribu-tion N2((b, b)t , I2), where Id denotes the d-dimensionalidentity matrix. In other words, if a random variable Zb

is N2((b, b)t , I2), then the distribution of Zb/‖Zb‖ isM1(b).

Analogously, in the three-dimensional case the dataare obtained by projecting on the unit sphere S2 observa-tions drawn from the Gaussian distributionN3((b, b, b)t , I3).

2. The model M2: This is another sub-model (a sort ofscale normal projected model) of the offset normal. Inthis case, the data are generated by projecting on theunit circumference the random points drawn from thestandard bi-variate Gaussian distribution multiplied bythe matrix Bb = ( 1 b

b 1

), where, again, b is a real para-

meter. This amounts to project on S1 the random obser-vations generated from a centered Gaussian distributionwith b2I2 + B2b as covariance matrix.

In the d-dimensional case (with d ≥ 3) the definitionof model M2 is analogous: the matrix Bb is replaced withanother d × d matrix with 1 in the main diagonal and b

outside. We are now concerned with the cases d = 2,3but in Sect. 4 below we will also use the model M2 inhigh-dimensional problems with d > 3.

3. The von Mises model: This is maybe the most popularmodel in directional statistics. The von Mises density onS1 and S2 (see, e.g., Mardia and Jupp 2000, pp. 36 and167) depends on a mean direction μ (= 0 and (1,0)t , re-spectively, in the simulations below in dimensions d = 2and d = 3 respectively) and a concentration parameterκ > 0. The uniform distribution appears as a limit asκ → 0.

Number of runs, significance level, sample sizes The out-puts in Tables 1 and 2 below (which correspond to d = 2and d = 3, respectively) show the proportion of rejections(power) over 10000 replications for distributions belongingto the models M1, M2 (values of b = 0.2, 0.4, 0.8) and von

Page 6: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

372 Stat Comput (2009) 19: 367–380

Table 1 Power of some tests on S1 under the null hypothesis and several alternatives

Test H0 Model M1 Model M2 von Mises model

b = .2 b = .4 b = .8 b = .2 b = .4 b = .8 κ = .2 κ = .4 κ = .8

n = 25

RP1 .0516 .1194 .3450 .7131 .0682 .1625 .6737 .0738 .1494 .4000

RP2 .0535 .1415 .4365 .8755 .0776 .1905 .8135 .0781 .1712 .5056

RP3 .0531 .1486 .4750 .9383 .0791 .1945 .8722 .0822 .1831 .5495

RP5 .0517 .1559 .5037 .9731 .0777 .2014 .9216 .0813 .1862 .5854

RP10 .0496 .1578 .5195 .9872 .0769 .2091 .9538 .0827 .1890 .6061

RP25 .0476 .1568 .5189 .9880 .0753 .2106 .9609 .0797 .1814 .6088

Kuiper .0489 .1258 .3907 .9540 .0804 .2252 .9700 .0753 .1527 .4809

Rayleigh .0497 .1851 .5945 .9937 .0497 .0576 .0747 .0877 .2075 .6703

n = 50

RP1 .0499 .2038 .5415 .8459 .0947 .3132 .8824 .0960 .2467 .5932

RP2 .0489 .2415 .6913 .9562 .1021 .3623 .9656 .1068 .2920 .7508

RP3 .0518 .2595 .7555 .9877 .1051 .3930 .9895 .1122 .3165 .8214

RP5 .0502 .2796 .8131 .9987 .1116 .4286 .9979 .1167 .3394 .8808

RP10 .0494 .2853 .8379 1 .1187 .4606 .9999 .1183 .3564 .9074

RP25 .0505 .2920 .8500 1 .1258 .4841 1 .1214 .3597 .9120

Kuiper .0477 .2175 .7181 .9999 .1238 .4960 1 .0964 .2735 .8238

Rayleigh .0482 .3280 .8913 1 .0524 .0542 .0697 .1280 .4035 .9431

n = 100

RP1 .0552 .3565 .7051 .9349 .1705 .5685 .9931 .1544 .4200 .7511

RP2 .0561 .4502 .8847 .9887 .1884 .6979 .9991 .1834 .5305 .9063

RP3 .0549 .4952 .9449 .9984 .1961 .7616 .9999 .1938 .575 .9624

RP5 .0527 .5226 .9793 1 .2057 .8193 1 .2007 .6124 .9921

RP10 .0506 .5398 .9918 1 .2143 .8538 1 .2014 .6327 .9974

RP25 .0505 .5509 .9922 1 .2280 .8741 1 .2057 .6416 .9981

Kuiper .0471 .4180 .9627 1 .2268 .8856 1 .1590 .5178 .9899

Rayleigh .0487 .5909 .9957 1 .0526 .0493 .0714 .2228 .7062 .9992

Mises (values of κ = 0.2, 0.4, 0.8). The intended significancelevel is in all cases 0.05. On the other hand, in both casesd = 2 and d = 3, the columns with heading H0 analyze thebehavior under the null and should be used to check the per-formance of the different tests to keep the intended type Ierror (0.05) under control.

The considered sample sizes are n = 25,50,100.

Software The simulation experiments have been imple-mented using the R language (R Development Core Team2006), The corresponding codes are available from the au-thors. We have used the R packages MASS (Venables andRipley 2002) and the above mentioned Circular (Lundand Agostinelli 2006). In the generation of the von Misesdistribution on the sphere Sd−1 with d > 2 (see also Sect. 4below) we have used the algorithm proposed by Ulrich(1984) in the corrected version developed by Wood (1994).

The conclusions of these simulations could be summa-rized as follows:

(a) In the case of circular data, d = 2 (Table 1), Rayleightest is the winner in both offset model M1 and von Misesmodel. The RPk tests based on a large number of di-rections (RP10 and RP25) rank second, outperformingKuiper test (especially in the von Mises model).

(b) Still in the circular case (Table 1), under model M2, theRayleigh test exhibits a total failure; in fact, it is almost“blind” against this alternative. Kuiper test is the win-ner under this model with a very slight advantage overRP25. Thus the overall performance of the RP methodsis quite satisfactory.

(c) The situation is similar in the case of spherical data (d =3, Table 2), except for the fact that the performance ofGiné’s test (which is the overall winner) turns out to bebetter than that of Kuiper’s test considered in the case

Page 7: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

Stat Comput (2009) 19: 367–380 373

Table 2 Power of some tests on S2 under the null hypothesis and several alternatives

Test H0 Model M1 Model M2 von Mises model

b = .2 b = .4 b = .8 b = .2 b = .4 b = .8 κ = .2 κ = .4 κ = .8

n = 25

RP1 .0519 .1383 .3498 .6545 .0795 .1969 .7050 .0593 .0888 .2102

RP5 .0498 .1760 .5757 .9708 .0917 .2703 .9390 .0642 .1097 .3115

RP10 .0513 .1912 .6383 .9958 .0926 .3005 .9775 .0623 .1147 .3377

RP25 .0535 .2046 .6876 .9986 .0957 .3311 .9934 .0607 .1193 .3691

RP50 .0513 .2010 .6858 .9991 .0911 .3258 .9951 .0637 .1233 .3785

RP100 .0504 .1994 .6810 .9994 .0933 .3343 .9956 .0595 .1170 .3749

Giné .0506 .2163 .7251 .9997 .1263 .6085 1 .0650 .1293 .4253

Rayleigh .0483 .2300 .7474 .9998 .0541 .0703 .0904 .0680 .1403 .4384

n = 50

RP1 .0480 .2139 .5145 .7925 .1203 .3628 .9122 .0645 .1222 .3386

RP5 .0460 .3249 .8665 .9972 .1502 .5715 .9985 .0752 .1824 .5835

RP10 .0510 .3665 .9248 1 .1632 .6716 1 .0775 .1943 .6317

RP25 .0561 .3986 .9515 1 .1791 .7532 1 .0810 .1970 .6764

RP50 .0557 .4072 .9592 1 .1886 .7734 1 .0833 .2029 .6974

RP100 .0516 .3959 .9556 1 .1648 .7694 1 .0827 .2043 .6981

Giné .0495 .4290 .9692 1 .2767 .9718 1 .0883 .2274 .7598

Rayleigh .0484 .4415 .9731 1 .0550 .0685 .0916 .0923 .2462 .7694

n = 100

RP1 .0506 .3597 .6673 .9034 .2089 .0902 .9972 .0902 .2095 .5195

RP5 .0518 .5946 .9813 1 .2992 .8928 1 .1072 .3229 .8615

RP10 .0470 .6520 .9982 1 .3262 .9551 1 .1097 .3538 .9274

RP25 .0465 .6724 .9996 1 .3501 .9883 1 .1200 .3808 .9533

RP50 .0468 .6884 .9998 1 .3686 .9954 1 .1181 .3851 .9575

RP100 .0532 .6971 .9999 1 .3770 .9969 1 .1123 .3793 .9579

Giné .0445 .7578 .9998 1 .6488 1 1 .1346 .4483 .9759

Rayleigh .0453 .7707 .9999 1 .0519 .0674 .0963 .1390 .4623 .9780

d = 2. Again the Rayleigh test fails under M2 but it is

the winner under M1.

(d) The H0 columns show that the RPk tests succeed in

keeping the significance level, with a similar perfor-

mance to that of the considered competitors.

(e) All in all, it seems that the tests RP25, for d = 2, and

RP100 (for d = 3) could be competitive with a reason-

able overall performance. The above results suggest, at

least, the interest of doing more research in this area.

Obviously, the RP tests based on few directions are not

to be recommended. They are included in the study only

for the sake of a more complete understanding on the

behavior of the RP method.

4 Projection tests for sphericity and homogeneity

In this section we show how to use the random projectionsmethodology in two further testing problems. The first oneis to test sphericity around a fixed point. The second one isthe homogeneity problem, that is to test whether two or moredistributions coincide. In both cases the purpose is to presentall the required ideas though, due to space constraints, weinclude just some practical illustrations rather than extensivesimulations.

4.1 Testing sphericity around a fixed point

A random vector Z = (Z1, . . . ,Zd)t in Rd is said to be

spherical, or to have a spherical distribution around the

Page 8: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

374 Stat Comput (2009) 19: 367–380

point μX , if for any matrix A in the group of the orthog-onal d × d matrices O(d) (i.e. matrices such that AAt =AtA = Id ), the vector A(Z − μX) has the same distributionas Z − μX.

Throughout this section, it will be assumed that the pointμX is known so that we may assume that μX = 0. Thus inwhat follows “spherical” will mean “spherical around 0”.

It is well-known that if Z is spherical and it has an ab-solutely continuous distribution with density f , then f mustbe necessarily of the form f (x) = h(‖x‖), for some univari-ate function h. Also, if Z has finite second-order moments,the sphericity assumption entails that the covariance matrix of Z fulfills

= σ 2Id , (1)

where σ 2 is the common variance of the Zi . Note that, if Zis normally distributed, sphericity is in fact equivalent to (1).

See Fang et al. (1990) for a detailed account on sphericaldistributions.

There are several sphericity tests widely used in appliedstatistics. A very popular, classical choice is Bartlett’s test. Itis based on the use of a (asymptotic) chi-square goodness offit methodology to test the null hypothesis that the correla-tion matrix is Id . See, e.g., Lattin et al. (2003, p. 109) for anelementary account of this test and its usefulness in principalcomponent analysis. Another related well-known procedure,where the null hypothesis is also (1), is Mauchly’s test; seeTimm (2002, p. 140).

These tests require the existence of second-order mo-ments (which is a condition external to the sphericityassumption). Moreover, they cannot be applied in high-dimensional problems when the dimension d is typicallylarger than the sample size n since they are usually basedon the estimation of the correlation matrix R (in particularBartlett’s test is based on log(det(R)) and the correspondingestimators will fail when n < d (or sometimes where n isclose to d).

The closeness between sphericity and uniformity pro-vides a simple procedure to test sphericity, without any mo-ment assumptions, in those high-dimensional cases. Morespecifically, it is easy to prove (see, e.g., Theorem 2.3 inFang et al. 1990) that if ‖Z‖ is almost surely positive and Zis spherical, then Z/‖Z‖ is uniformly distributed on Sd−1.So we could think of replacing the null hypothesis H0 : Zhas a spherical distribution, with H̄0 : Z

‖Z‖ is uniformly dis-

tributed on Sd−1. While these hypotheses are not equiv-alent (as the second one is less restrictive) in most prac-tical cases the violations of sphericity will arise from thenon-fulfillment of H̄0. Note also that the classical testsof sphericity (Bartlett, Mauchly, . . .) are non-universal aswell, since they test when the covariance matrix is diago-nal, which is not equivalent to sphericity either. Thus, any

uniformity test suitable for high-dimensional data (in partic-ular, our projection-based tests) will work in practice to testsphericity.

However, we next show how the random projectionsmethodology can be used to get a specific consistent spheric-ity test. The whole procedure relies again on the above com-mented Theorem 4.1 of CFR07. In view of the assumptionsin that theorem, our test will require that the underlying dis-tribution has finite moments of any order. As a counterpart, itis consistent among those distributions fulfilling these con-ditions and can be applied for any dimension and samplesize.

The main result in this section (Theorem 4.2) is based onProposition 4.1 which could have some independent inter-est.

Proposition 4.1 (1) If λ is the Haar measure on O(d), Ais a random matrix with distribution λ, and U is a randomvector uniformly distributed on Sd−1, then

(a) The distribution of AU is uniform on Sd−1.(b) The random vectors U and AU are independent.

(2) Conversely, if U1 and U2 are two independent randomvectors uniformly distributed on Sd−1, then there exists aunique random matrix on O(d) such that U2 = AU1 and thedistribution of A does not concentrate mass on any propersubgroup of O(d).

The following Theorem 4.2 is based on Theorem 4.1 inCFR07 which requires that the distribution of the randomvector X is determined by their moments. That means thatthere is no other distribution Q which satisfies∫

Rd

P (x)dPX(x) =∫

Rd

P (x)dQ(x),

for every polynomial P on Rd .

Several sufficient conditions have been proposed to assurethis. One of the most popular is the so-called Carleman’scondition, defined by

∑n m

−1/nn = ∞, where mn is the ab-

solute moment mn = E‖X‖n. This condition is satisfied forcompact-supported or Gaussian distributions and, more gen-erally, for distributions with light tails; see, for example,Shohat and Tamarkin (1943).

Theorem 4.2 Let X be a random vector with values in Rd

whose distribution is determined by their moments. Denoteby λ the Haar measure on Sd−1. Then X has a sphericaldistribution if and only if

λ × λ[{(h1,h2) : ht

1X ∼ ht2X}] > 0. (2)

Remark 4.2.1 If we replace (2) by

λ × λ[{(h1,h2) : ht

1X �∼ ht2X}] = 0,

Page 9: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

Stat Comput (2009) 19: 367–380 375

Theorem 4.2 states the quite intuitive fact that if the distrib-ution of X is spherical and h1,h2 ∈ Sd−1 then ht

1X and ht2X

are identically distributed.But statement (2) takes advantage of Theorem 4.1 in

CFR07 in order to provide a result which is easily translatedto construct a sphericity test as shown next.

Theorem 4.2 suggests the following procedure to testsphericity: Given a random sample X1, . . . ,XN from a ran-dom vector X, take n a positive integer, such that n < N

and choose h1,h2 ∈ Sd−1 independent with uniform distri-bution.

Now, consider the independent samples of real randomvariables {ht

1X1, . . . ,ht1Xn} and {ht

2Xn+1, . . . ,ht2XN }. By

Theorem 4.2, testing the sphericity of X is equivalent to test-ing that the distribution that produced both samples is thesame. Thus, we can test the sphericity of X just testing (forinstance, with the KS-test) the hypothesis of homogeneity ofboth samples. This procedure will be denoted RP1-S. Notethat it does not require us to know in advance the distribu-tions PX nor Pht

1X. Likewise, this procedure can be adaptedto employ k pairs of projection directions h1,h2 (instead ofjust one) using the standard Bonferroni device, i.e., to rejectH0 at a level α whenever it is rejected for at least one of thek tests performed at level α/k. In this case we cannot usethe minimum p-value as a statistic test itself (as indicated inSect. 2.2) since the null hypothesis is composite (that is, it isfulfilled for more than one distribution). Indeed while all theindividual p-values are uniformly distributed under the nullhypothesis, their joint law will depend in general on the un-derlying spherical distribution and so the minimum p-valuewill not be in general distribution free. This procedure willbe denoted RPk-S.

In our view there are mainly two situations where thesetests are especially suitable, when compared with the clas-sical choices (e.g., Bartlett’s test) based on covariance ma-trices: First, the high dimensional cases with d � n; second,those cases where the underlying distribution is not sphericalbut its projection on the unit sphere is still uniformly distrib-uted. As we will show in the next subsection such examplesarise in a quite simple way, even in R

2.

4.2 Some Monte Carlo comparisons for sphericity tests

We have compared our projection RP1-S and RP10-S testswith Bartlett’s test in high dimensional situations (d =50,100) using two sample sizes: n = 50,100. More specifi-cally, we have evaluated, in 2000 runs, the empirical level(i.e. the rejection proportion under the null hypothesis ofuniformity) of Bartlett’s test and that of the RPk-S tests. Wehave also computed the respective empirical powers underthe offset model M2 with a parameter value b = 0.5. Theresults are given in Table 3. The entry NA means that therewere errors when computing log(det(R)) in Bartlett’s test.

As could be foreseen, Bartlett test simply does not workin these situations since it is unable to keep the level andshows an erratic behaviour, associated with the numericalinstabilities in the estimation of a large correlation matrixfrom too few sample observations. RP1-S tends to be ratherconservative. This is likely due to the use of the KS test inthe final stage. The power under the alternative hypothesisis moderate for k = 1 but increases appreciably for k = 10,even under the present low rates of n/d .

Another situation in which the RPk-S tests could be par-ticularly useful is in those cases where the lack of unifor-mity cannot be detected via an uniformity test since the pro-jections of the data on the unit sphere are uniformly dis-tributed. Such distributions are very easy to construct. Forexample we could consider a model under which the datacome from a standard normal distribution, but those belong-ing to a given cone C are multiplied by a positive constantb. We have done 2000 simulations using this model (whichwe call M3) in the case d = 2, with C = {(x1, x2) : x1 < 0}.The results are shown in Table 4.

4.3 Testing homogeneity—The case of compositional data

Our last proposal has to do with the application of the ran-dom projections methodology to compositional data (seeSect. 1.3 above).

As previously mentioned, the main point is that composi-tional data can also be seen as “circular data” by just chang-ing the ‖.‖2-sphere by the ‖.‖1-sphere Sd−1

1 (or rather itspositive face). The basic result (Theorem 2.2) which allowsus to characterize distributions in the Euclidean sphere re-mains true for Sd−1

1 .As a consequence, one could also define uniformity tests

for compositional data along similar lines to those indicatedin Sect. 2. A technical difference is the fact that, unlike thecircular case, the null distributions are not expected to show“closed” relatively simple forms in the compositional case.Therefore they should be approximated by Monte Carlo pro-cedures. Otherwise, the method would be analogous.

However, we will not focus here on the uniformity test.We will rather briefly use the case of compositional data toillustrate another use of the random projection methodol-ogy, namely the development of homogeneity tests. In viewof Theorem 2.2 the idea is quite simple: In a two sampleproblem with compositional data, we will decide that bothsamples come from the same distribution if their respec-tive one-dimensional projections along a random directionare equally distributed. This can be decided by using thestandard distribution-free Kolmogorov-Smirnov homogene-ity test (KS2). Of course, the idea can also be adapted tobe used with several independent random directions, insteadof just one. In this case we use (as in the sphericity test)the Bonferroni procedure to combine the tests performed

Page 10: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

376 Stat Comput (2009) 19: 367–380

Table 3 Power of the RP1-S,RP10-S and Bartlett’s test underthe null hypothesis and modelM2 with b = 0.5 inR

d , d = 50,100, and samplesizes n = 50,100

Dimension Test H0 Model M2, b = 0.5

n = 50 n = 100 n = 50 n = 100

d = 50 RP1-S .028 .034 .384 .598

RP10-S .016 .024 .466 .959

Bartlett NA .143 NA 1

d = 100 RP1-S .034 .041 .366 .578

RP10-S .021 .020 .558 .973

Bartlett 1 NA 1 NA

Table 4 Power of the RP1-S,RP10-S and Bartlett’s testsunder the null hypothesis andmodel M3 with b = 1,3,6,8 inR

2, and sample sizesn = 100,200

Test H0 Model M3, n = 100 Model M3, n = 200

b = 3 b = 6 b = 8 H0 b = 3 b = 6 b = 8

RP1-S .041 .176 .294 .331 .042 .311 .447 .498

RP10-S .025 .216 .600 .702 .027 .706 .955 .969

Bartlett .053 .083 .098 .110 .044 .070 .098 .110

Table 5 p-values (obtained in one run) and proportion of rejections(along 500 runs) of the RP-homogeneity tests for the data set Ani-malVegetation, with different numbers of random directions

k 1 2 3 5 10

p-values .52 .52 .60 .57 .68

Rejections .064 .054 .008 0 0

along different directions, instead of using the test based onthe minimum p-value. Again the reason is that the null hy-pothesis of homogeneity is composite and, in general, thenull distribution of the minimum p-value will depend on thecommon distribution of both samples.

As a practical illustration, let us consider a data set ana-lyzed in Aitchison (1986, p. 22). It arises as a consequenceof an ecological experiment during which “plots of land ofequal area were inspected and the parts of each plot whichwere thick or thin in vegetation and dense or sparse in an-imals were identified. From this field work the areal pro-portions of each plot were calculated for the four mutu-ally exclusive and exhaustive categories: thick-dense, thick-sparse, thin-dense, thin-sparse. These sets of proportions arerecorded for 50 plots from each of two different regionsA and B”. We quote from the help file corresponding tothe data set AnimalVegetation in the R package composi-tions, by van den Boogaart et al. (2006).

The second row of Table 5 shows the p-values (afterBonferroni’s correction) obtained in one execution of theRP-homogeneity test with several numbers of independentrandom directions. The third row contains the proportionsof rejections of the null (homogeneity) hypothesis over 500executions of the test (at level 0.05), with different randomchoices of the projection directions.

As a consequence of this analysis we must say that noevidence against homogeneity has been found. The choiceof k does not essentially affect the conclusion.

5 A real data example: are the comet orbits uniformlydistributed?

5.1 Some background

It is a well-known fact that the planes defined by the orbitsof the planets in the Solar System are all nearly coincidentwith the ecliptic (the plane of the Earth’s orbit). The onlypartial exception is Pluto, which in fact is no longer consid-ered as a true planet, according to the recent decision of theInternational Astronomical Union.

Such an “almost coincidence” in the orbits has intriguedscientists for a long time. Thus D. Bernoulli (in the 1730’s)wondered if this fact could happen “by chance”. To put thisquestion in a statistical framework, we could consider thatthe planet orbits are defined by the corresponding normalvectors. So they reduce to a sample of nine points in theunit sphere S2. Therefore, one could think of using a unifor-mity test on the sphere. This has been done, e.g., by Mardiaand Jupp (2000, p. 209) obtaining strong evidence againstthe uniformity hypothesis. Whereas, as these authors rightlypoint out, it is not clear what is the random variable fromwhich the planet orbits are a i.i.d. sample, this result has atleast the interest of suggesting an analogous, but more am-bitious question regarding the possible uniformity of cometorbits.

It is believed that most of the comets are produced in theKuiper Belt or in the Oort Cloud following a nowadays un-predictable mechanism. Thus, their orbits could be seen as

Page 11: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

Stat Comput (2009) 19: 367–380 377

multiple realizations (in fact we have thousands of them) ofan experiment whose output is not deterministic.

Comet orbits are always conical sections. So they are ei-ther elliptical (which lead to periodic comets), parabolic orhyperbolic (non-periodic comets). The orbit plane is usuallydetermined by two angles, namely the inclination (denotedby i) which is the angle between the normal vector of theorbit and the normal vector of the ecliptic, and the so-calledlongitude of the ascending node, often denoted by �; see,e.g., Teets and Whitehead (1998) for a definition of �. Theexpression of the normal vector to the orbit plane in termsof i and � is

R = (sin(i) sin(�),− sin(i) cos(�), cos(i))t . (3)

Thus, the orbit of a comet can be seen as an observationfrom the S2-valued random variable R. We are interested intesting the null hypothesis that R is uniformly distributed onS2 from a random sample R1, . . . ,Rn of this variable.

The paper by Jupp et al. (2003, see also referencestherein) tackles this problem with a particular emphasis onthe statistical aspects. The analysis focuses on a data set of658 single-apparition long-period cometary orbits from thecatalog of Marsden and Williams (1993). Such data providestrong statistical evidence against the uniformity assump-tion for the distribution of R in this type of cometary or-bits. Such a conclusion is somewhat surprising as it con-tradicts the isotropy properties which could be expected apriori. According to the explanation (supported by a care-ful data analysis) proposed by Jupp et al. (2003), the lackof uniformity is largely due to an observational bias whichfavours the observation of cometary orbits near to the eclip-tic. As a solution, these authors provide a new probabilis-tic model on the unit sphere which incorporates an observa-tional window defined by the inequality | sin(i)| ≤ ε. Thismodel assumes that the probability density of R inside theobservational window is uniform and its value outside (i.e.for | sin(i)| > ε) is (2/πε) arcsin(ε/ sin(i)). The parameterε can be estimated by maximum likelihood: The estimatefound in Jupp et al. (2003) is ε̂ = .84. The proposed modelshows a satisfactory goodness of fit to the data set at hand.

5.2 The data

We give here a further look at this problem, using now adifferent database, freely available from the NASA web-site, http://ssd.jpl.nasa.gov/sbdb_query.cgi#x. The follow-ing steps describe exactly how to obtain the data we haveanalyzed. The terminology is that used in the NASA webpage:

STEP 1Object group: Select All. Object kind: Tick Comets.

Numbered state: Choose Unnumbered (the comets receive

Table 6 p-values of different uniformity tests for the NASA cometdata set

RP50 RP100 Giné Rayleigh

.260 .250 .060 .210

a permanent number prefix only after the second perihelionpassage). Limit to selected orbit classes: In the box namedComet Orbit Classes select all, except Parabolic and Hyper-bolic. Limit by other characteristics: In the box Select or-bital parameter choose period(years). In the next box, oper-ator, select >=. In the box on the right write 200. Then tickthe box Add.

STEP 2In Object Fields select object internal database ID, and

object full name/designation. In Object and Model Parame-ter Fields select inclination and longitude of the ascendingnode.

The above procedure led to a data file which, by Decem-ber 14, 2007, included the values of (i,�) for 211 comets.An inspection of these data showed that four of them (in theplaces 12 to 15) were identical up to the fourth decimal posi-tion. They were all labelled “Great September Comet”. Ourstatistical analysis has been applied to the 208 data obtainedby suppressing three of these repeated observations.

An exploratory data analysis, similar to that in Jupp et al.(2003) has been performed in order to visually detect anyclear deviations from uniformity.

Similarly to the case of the data studied in Jupp et al.(2003), the distribution of i shows some deviations from uni-formity near the extremes of its range. However, as shownin the following subsection, such deviations do not provideenough statistical evidence against uniformity.

5.3 Results and conclusions

Giné’s and Rayleigh’s tests, as well as the random projectiontests RP50 and RP100 (based on k = 50 and 100 random di-rections, respectively) have been performed with these data.The corresponding p-values are given in Table 6.

As a conclusion, we can say that no statistical evidencehas been found against the uniformity hypothesis.

The p-value of Giné’s test is the closest one to providestatistical evidence (but yet without enough support) againstuniformity. This could be perhaps interpreted in terms of ahigher sensitivity of this test with respect to the mild obser-vational bias shown in the left histogram of Fig. 1.

6 Technical appendix

Proof of Theorem 2.2 (1) Obviously, we can consider thatX is a bounded variable taking values on R

d and, without

Page 12: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

378 Stat Comput (2009) 19: 367–380

Fig. 1 Histograms of thevariables “inclination” (i) and“longitude of the ascendingnode” (�)

loss of generality, we can assume that U = Z/‖Z‖ where Zis Gaussian.

Thus, the condition on the moments of X in Theorem 4.1in CFR07 holds. However, taking into account that, up to aconstant, the conditional distribution of XtU coincides withthat of XtZ the result follows easily from Theorem 4.1 inCFR07.

(2) Similarly to the case 1, the proof is a direct conse-quence of the extension of Theorem 4.1 in CFR07 to theBanach spaces setup given in Cuevas and Fraiman (2008,Theorem 5) which covers the case of the non-Hilbert norm‖.‖1, that is, this result proves that the distribution of X isdetermined from that of the dual element XtU where U isuniform on Sd−1

1 . �

Proof of the convergence in Remark 2.2.4 Let us considera fixed sample X1,X2, . . . ,Xn. Obviously, the sequence{Mk}k is increasing and bounded. Thus, it does converge.Let us show that, its a.s. limit is M .

Given u ∈ Sd−1, let us denote Xui = Xt

iu, i = 1, . . . , n

and let Xu(1), . . . ,X

u(n) be the corresponding ordered statis-

tics Xu(1) ≤ · · · ≤ Xu

(n). Taking into account that F0 is con-tinuous, it is not difficult to prove that

supt∈[−1,1]

|Fn(t |u) − F0(t)|

= supi=1,...,n

[

max

(∣∣∣∣i

n− F0

(Xu

(i)

)∣∣∣∣ ,

∣∣∣∣i − 1

n− F0

(Xu

(i)

)∣∣∣∣

)]

. (4)

Let us consider the map H : Rd → R

n defined by

H(u) =(Xu

(1), . . . ,Xu(n)

), u ∈ R

d

(remember that the sample X1,X2, . . . ,Xn is fixed). Thismap is continuous, and, since F0 is also continuous, we have

that for every i = 1, . . . , n, the map

u → max

(∣∣∣∣i

n− F0

(Xu

(i)

)∣∣∣∣ ,

∣∣∣∣i − 1

n− F0

(Xu

(i)

)∣∣∣∣

)

is continuous. Thus, the expression in (4) is also continuouson u. Then, the proof concludes taking into account that thesequence {Uk}k is a.s. dense on Sd−1 because the support ofthe distribution employed to generate the random directionsis Sd−1. �

The following Proposition 6.1 is an auxiliary result re-quired to prove Theorem 4.2. It implies that if X is not spher-ical and we choose at random an orthogonal matrix A, thenAX �∼ X, for almost all A. Moreover, it is possible to take asλ in this proposition the Haar (uniform) measure on O(d).

Proposition 6.1 Let λ be a probability measure on the or-thogonal group O(d), that does not concentrate mass on anyproper subgroup. Let X be a random vector on R

d . Then, Xis spherical if and only if λ ({A ∈ O(d) such that AX ∼ X})> 0.

Proof The “only if” part is obvious. Concerning the “if"part, let = {A ∈ O(d) : AX ∼ X}. Observe that

if A,B ∈ , then AB ∈ , since BX ∼ X implies ABX ∼

AX ∼ X,if A ∈ , then A−1 ∈ , since AX ∼ X implies X =A−1AX ∼ A−1X,

which implies that is a subgroup of SO(d).Therefore, we have that = O(d) because is a group

and λ does not concentrate mass on any proper subgroup ofO(d). �

Proof of Proposition 4.1 Let B1,B2 be Borel sets in Sd−1.Obviously, the surface measure of the sets B1 and AtB1 co-incide. From here, and the independence between U and A,

Page 13: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

Stat Comput (2009) 19: 367–380 379

we obtain that

P [AU ∈ B1] =∫

P[U ∈ AtB1|A

]dP =

P [U ∈ B1]dP

= P [U ∈ B1],

which proves (a).Now, from the independence between A and U and the

fact that, if h1,h2 ∈ Sd−1, then Ah1 ∼ Ah2, we have that

P [U ∈ B1,AU ∈ B2] =∫

B1

P [Ah ∈ B2|U = h]dPU(h)

=∫

B1

P [Ah ∈ B2]dPU(h)

= P [Ae1 ∈ B2]P [U ∈ B1], (5)

where e1 is any vector in Sd−1. A similar reasoning, wouldgive us that P [Ae1 ∈ B2] = P [AU ∈ B2] and (b) followsfrom (5).

Concerning 2, we have that for every h1,h2 ∈ Sd−1, thereexists a unique Ah1,h2 ∈ O(d) such that h2 = (Ah1,h2)h1.Since A is a continuous function of h1 and h2, we have thatthe map ω → A(ω) := AU1(ω),U2(ω) is measurable and, byconstruction, U2 = AU1.

If PA were concentrated in a proper subgroup of O(d),then the distribution of U2 given that U1 = u would be con-centrated on the transformation of u by this subgroup andit would not be uniform on Sd−1 which contradicts the as-sumptions. �

Proof of Theorem 4.2 Let A be a random matrix, indepen-dent from X with the Haar distribution on O(d). Accordingto Proposition 6.1, the sphericity of X is equivalent to thatAX ∼ X.

Let U be a random vector uniformly distributed on Sd−1

and independent from A and X. Theorem 4.1 in CFR07 im-plies that AX ∼ X if and only if

PU[h : htAX ∼ htX

]> 0. (6)

But this is equivalent to saying that

0 < PU[h : P [(UtA)X|U = h] = P [UtX|U = h]] . (7)

Obviously, Pht X = P [(UtA)X|U = h] and from 1.(b) inProposition 4.1 we obtain the independence between UtAand U. Thus, (7) is equivalent to

0 < PU[h : UtAX ∼ htX

].

Finally, by 1.(a) in Proposition 4.1, we know that the dis-tribution of UtA is uniform and then we finally have that (6)is equivalent to (2). �

Acknowledgements We are very grateful to César Sánchez-Sellerofor a useful discussion on multiple contrasts and, in particular, forbringing to our attention the paper by Berk and Jones (1978). We alsothank three anonymous referees and an Associate Editor whose com-ments have considerably contributed to improve the manuscript.

This work has been partially supported by Spanish grantsMTM2005-08519-C02-02 and PAPIJCL VA102/06 (J. Cuesta-Albertos),MTM2007-66632 (A. Cuevas and R. Fraiman) and the IV PRICITprogram titled Modelización Matemática y Simulación Numérica enCiencia y Tecnología (A. Cuevas).

References

Aitchison, J.: The Statistical Analysis of Compositional Data. Chap-man and Hall, London (1986)

Aitchison, J., Egozcue, J.J.: Compositional data analysis: Where arewe and where should we be heading? Math. Geol. 37, 829–850(2005)

Berk, R.H., Jones, D.H.: Relatively optimal combinations of tests sta-tistics. Scand. J. Stat. 5, 158–162 (1978)

Brown, T.C., Cartwright, D.I., Eagleson, G.K.: Correlations and char-acterizations of the uniform distribution. Aust. J. Stat. 28, 89–96(1986)

Cuesta-Albertos, J.A., Fraiman, R., Ransford, T.: Random projectionsand goodness-of-fit tests in infinite-dimensional spaces. Bull.Braz. Math. Soc. 37, 1–25 (2006)

Cuesta-Albertos, J.A., del Barrio, T., Fraiman, R., Matrán, C.: Therandom projection method in goodness of fit for functional data.Comput. Stat. Data Anal. 51, 4814–4831 (2007a)

Cuesta-Albertos, J.A., Fraiman, R., Ransford, T.: A sharp form of theCramer–Wold theorem. J. Theor. Probab. 20, 201–209 (2007b)

Cuevas, A., Fraiman, R.: On depth measures and dual statistics: Amethodology for dealing with general data. J. Multivariate Anal.(2008, to appear)

Fang, K.T., Kotz, S.K., Ng, K.W.: Symmetric Multivariate and RelatedDistributions. Chapman & Hall, London (1990)

Figueiredo, A.: Comparison of tests of uniformity defined on the hy-persphere. Stat. Probab. Lett. 77, 329–334 (2007)

Fisher, N.I.: Statistical Analysis of Circular Data. Cambridge Univer-sity Press, Cambridge (1993)

Giné, E.: Invariant tests for uniformity on compact Riemannian mani-folds based on Sobolev norms. Ann. Stat. 3, 1243–1266 (1975)

Juan, J., Prieto, F.J.: Using angles to identify concentrated multivariateoutliers. Technometrics 43, 311–322 (2001)

Jupp, P.E., Kim, P.T., Koo, J.Y., Wiegert, P.: The intrinsic distributionand selection bias of long-period cometary orbits. J. Am. Stat.Assoc. 98, 515–521 (2003)

Lattin, J., Carroll, J.D., Green, P.E.: Analyzing Multivariate Data.Brooks/Cole, Pacific Grove (2003)

Lund, U., Agostinelli, C.: Circular: Circular Statistics. R package ver-sion 0.3-6 (2006)

Mardia, K.V., Jupp, P.E.: Directional Statistics. Wiley, Chichester(2000)

Marsden, B.G., Williams, G.V.: Catalogue of Cometary Orbits, 8thedn. Minor Planet Center, Smithsonian Astrophysical Observa-tory, Cambridge (1993)

R Development Core Team: R: A Language and Environ-ment for Statistical Computing. R Foundation for Statisti-cal Computing, Vienna (2006). ISBN 3-900051-07-0, URLhttp://www.R-project.org

Shohat, J.A., Tamarkin, J.D.: The Problem of Moments. Am. Math.Soc., Providence (1943)

Page 14: On projection-based tests for directional and ...entsphere.com/pub/pdf/2009 Cuesta-Alberta, On... · Kolmogorov-Smirnov (K-S) procedure. Rayleigh’s test is rather designed to detect

380 Stat Comput (2009) 19: 367–380

Stephens, M.A.: Use of the von Mises distribution to analyse continu-ous proportions. Biometrika 69, 197–203 (1982)

Teets, D.A., Whitehead, K.: Computation of planetary orbits. CollegeMath. J. 29, 397–404 (1998)

Timm, N.H.: Applied Multivariate Analysis. Springer, New York(2002)

Ulrich, G.: Computer generation of distributions on the m-sphere.Appl. Stat. 33, 158–163 (1984)

van den Boogaart, K.G., Tolosana, R., Bren, M.: Compositions:Compositional Data Analysis. R package version 0.91-6, URL:http://www.stat.boogaart.de/compositions (2006)

Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4thedn. Springer, New York (2002)

Wood, A.T.A.: Simulation of the von Mises-Fisher distribution. Com-mun. Stat. Simul. Comput. 23(1), 157–164 (1994)


Recommended