+ All Categories
Home > Documents > Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

Date post: 04-Jun-2018
Category:
Upload: judith-lugo
View: 272 times
Download: 0 times
Share this document with a friend

of 11

Transcript
  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    1/11

    On a Notion of Data Depth Based on Random Simplices

    Author(s): Regina Y. LiuReviewed work(s):Source: The Annals of Statistics, Vol. 18, No. 1 (Mar., 1990), pp. 405-414Published by: Institute of Mathematical StatisticsStable URL: http://www.jstor.org/stable/2241550.

    Accessed: 11/10/2012 18:48

    Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at.http://www.jstor.org/page/info/about/policies/terms.jsp

    .

    JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

    of scholarship. For more information about JSTOR, please contact [email protected].

    .

    Institute of Mathematical Statisticsis collaborating with JSTOR to digitize, preserve and extend access to The

    Annals of Statistics.

    http://www.jstor.org

    http://www.jstor.org/action/showPublisher?publisherCode=imshttp://www.jstor.org/stable/2241550?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/stable/2241550?origin=JSTOR-pdfhttp://www.jstor.org/action/showPublisher?publisherCode=ims
  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    2/11

    The Annals of Statistics1990, Vol. 18, No. 1, 405-414ON A NOTION OF DATA DEPTH BASED ONRANDOMSIMPLICES1

    ByREGINA . LiuRutgers niversity

    To thememoryfmy eacher nd friend ohnVanRyzinFor a distribution on RP and a point x in RP, the simplicial depthD(x) is introduced,which s the probability hat the point x is containedinside a random simplexwhose vertices re p + 1 independent bservationsfromF. Mathematically nd heuristicallyt is arguedthat D(x) indeed canbe viewed as a measure of depth of the point x with respect to F. Anempiricalversionof D(-) givesrise to a natural ordering f the data points

    fromthe centeroutward.The ordering hus obtained leads to the introduc-tion of multivariategeneralizations f the univariate sample median andL-statistics.This generalized ample median and L-statistics re affine quiv-ariant.

    1. Introduction. The maingoalof hispaper s to ntroduce newnotion fdata depth.Thisnotionmerges aturallyutofa fundamentaloncept nder-lying ffine eometry,amely hatof simplex,nd t satisfiesherequirementsone would expectfrom notionof data depth.Thus it leads to an affineinvariant, enter-outwardanking fthedata points.We nowturn o a detaileddescription.Let X1,..., Xn be a bivariate ata set. Given nythree ata pointsXi, X1and Xk,wecan forn heclosed riangle ithvertices i, X1and Xk whichwedenotebylA(Xi,X1,Xk). From hen data points,we generaten thisway(n)triangles. o any pointx in R we can associate hen thenumber f thosetriangleswhich ontainx inside.Thisnumber houldbe largerfx is deepinsideornearthe center fthe data cloud, nd smaller f x is relatively earits outskirts. his suggestshefollowingotion fdepthmeasure, hichwe shallcall simplicialdepth since t is based on triangles nd theirp-dimensionalgeneralizations,hich resimplices.enotebyx E A(Xi,Xj,Xk) theevent hatx falls nside heclosed andomriangle (Xi,Xj,Xk) and by (A) the ndicatorfunctionfan eventA, i.e., (A) = 1 if A occurs nd I(A) = 0 otherwise.hen(1.1) Dn(x) - (3) E I(x E A(Xi, j,Xk))1< i

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    3/11

    406 R. Y. LIUsponding o each triangle (Xi,Xj,Xk),oneby one until ll (') trianglesreexhausted. he resultingolidwillrepresentheexact hapeofDn(.).It is clear hatDn(x)definedn 1.1) s an empirical ersionftheprobability(1.2) D(x) PF(XE_A(X1,X2,X3))if Xi's are i.i.d. with common istributionunction . The quantity (x)should ssumehigher alueswhenx is nearthe center f the distributionndshould end o0 as x moveswayfrom he enter.We shallrefer o D(x) in 1.2)as thesimplicial epth SD) ofx with especto F in R2 and toD (x) in 1.1)asthe sample implicial epth fx with espect o thedata cloudX1,..., X,.It maybe instructiveo considerheunivariatenalogofSD, namely,(1.3) D(x) = P(x E X1X2).Here x is in R', X1 and X2are two ndependentbservationsrom univariatec.d.f.F and X1X2representshe closed inesegmentonnecting 1 and X2.WhenF is continuous,(1.4) D(x) = 2F(x)[1 - F(x)].It followsmmediatelyhat any pointwhichmaximizes (x) is a populationmedian.The maximum alue of D(*) is 2 in thiscase, and D(x) decreasesmonotonicallyo0 as x is pulled wayfromhemedian.

    The aboveobservationuggests hat we call a point nR2 whichmaximizesD(.) a bivariate implicialmedian.Wedenote ucha pointby L andwillalsocall it centerwhengeometricnderstandings emphasized.he sampleversionofthe bivariatemedians then(1.5) A = thedata pointXE0 ttainingighestample D.Ifthemaximumsachievedt more han nedatapoint,we can define as theaverage fthosedatapointswhichmaximizen(.).The heuristic otivationoras thesamplemedian s thefollowing:f D(.) is continuousnd ,1 s theuniquemaximizerorD(-) inR , an estimatororuwouldbe a pointxo intheplanewhichmaximizesn(*). fF hasa nonzero ensityntheneighborhoodfu,we would xpect hedata pointXiowhichmaximizesn(-)among ll thedatapoints o be closetoxoand,hence, o u. Theseargumentsanactually emaderigorous,s we shall ee nSection .A major askhere s to show hatD(x) definedn 1.2)can ndeed eviewedas a measure fdepth; hat s,toshowformallyhat t possessesomekindofmonotonicityropertyimilar o the one thatD(-) possessesn theunivariateanalog.This is establishednTheorem of Section . To be moreprecise,hetheoremsserts hatwhen heunderlyingistributions angularlyymmetric(seeSection for hedefinition)bout point , thenD(x) decreasesmonotoni-callyas x moves wayfromu long nyfixed ay.All theconceptsntroducedo farcanbe easilyextendedo higher imen-sions.For a distribution on RP,a randomrianglenthedefinitionf SD isnow replacedby a random implexwhosevertices re p + 1 independent

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    4/11

    DATA DEPTH BASED ON RANDOMSIMPLICES 407observationsrom . Consequently,edefine:1. the simplicial epth SD) function (*) onRPwith especto F tobe(1.6) D(x) PF(x E S[Xl,..., XP+1]),whereX1,..., Xp+1 reindependentbservationsrom and S[X1,..., Xp+?]is the simplexwith ertices l,..., Xpj (inotherwords, [X1,..., Xpj?] isthe set of all pointsnRP which re convex ombinationsfX1,..., Xp+1);2. a (multivariate) implicialmedian f F, I, to be a pointwhichmaximizesD(.);3. thesample implicial epth unctionn(*) obe(1.7) D() -(+1)E Ix E=S Xil, .. Xi+](X) p + J1

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    5/11

    408 R. Y. LIUL-statisticss

    n n(1.10) L= Xufw(i/n)/ L w(j/n).i=l j=1Whenw(t) = I(t < 1/n),Lw s the ame s the amplemedian in ifDn(-) suniquelymaximizedmong hesample oints.Whenw(t) = I(t < 1 - a), Lwis a 100a% trimmedmean. n practice, = 0.05or 0.1 arethecommonlysedvalues.We would ike o mentionhat he rimmedmean with = 0.95 or so) shouldbe an appealing lternativeo A2,when hepopulation D is notuniquelymaximized.

    (B) Directional ata and simplicial epth. A directionn theplane can beviewed s a pointon a unit ircle,while directionn three-dimensionalpacecanbesimilarly iewed s a point n a unit phere. he study fdirectionalataleads to situationswhere he ambient pace s not a p-dimensionaluclideanspace,but rather spheren p - 1) dimensions.he notion f implicialepthcan be adaptedby using eodesic implicesnstead f implices.or example,nthe case ofa circle,he short rc connectingwo observationss to replace herandom ine segment sed to define D in R1. Thisis investigatedn Liu andSingh 1988).(C) Testing hecenter f (angular) symmetry.We are often equired odeterminehe center f a symmetricopulation. class of distributionsome-what broader hantheusualsymmetricistributionss the class of angularlysymmetricistributions.oughly peaking, distributions angularlyymmet-ric about a pointx ifeveryhyperplaneassing hrough divides he wholespace into two half-spaces ith qual probabilities.Fortheprecise efinitionand furtheriscussions,ee Section2.) In thepresent aper, t is shown cf.Theorems and4) that heSD ismaximizedt thecenter f ngular ymmetryand takes there he value2-P in RP. Thus, f bois a hypothesizedenter fangular ymmetry,hen largevalue of 2-P - Dn(bo)) s an indicationf thenullhypothesiseing alse. heobservationn Remark ofSection saysthatthe test tatistic2-P - Dn(bo)) s a degenerate -statistic.his fact eads us toconclude hat n(2-P- Dn(bo))has as its weak imit linearcombinationfx2-distributionscf.Gregory1977)].A detailed tudy f thistesting rocedurewillappear eparately.Other pplicationsfSD include eriving class ofmultivariatecales nd amultivariate lassificationule. n fact, measure f scale can be derived yconsideringow far wayone has tomovefrom he centeri.e.,the maximum

    pointofthesampleSD) in order o reduce he SD value to a fractionf tsmaximum. s for lassification,he deathere sroughlyhefollowingseeGrossand Liu (1988)]: Suppose hattwotrainingamples romwo differentopula-tions regiven.A classificationule s a wayofassigningnynewdatapointZto one of these wopopulations.ucha rule an be obtained ycomparingherelative enter-outwardanksofZ w.r.t. he training amples.Z should be

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    6/11

    DATA DEPTH BASED ON RANDOM SIMPLICES 409assigned o the population hose rainingample eadsto a smallerelative ankforZ.

    Generalremarks. Anearlier oncept fdata depthwas ntroducedyTukey(1975). Tukey's data depth nd the related amplemedian tudied n Stahel(1981),Donoho 1982), nd Donoho nd Gasko 1988) re basedon the nspectionof every one-dimensionalrojectionfthe sampledata. In a differentirec-tion,Oja (1983) defined samplemedian n RP as a pointx whichyields heminimumotalvolume f ll simplices ormed y x and p ofthedata points.Asfaras the generalizedmultivariate edian s concerned,here s an extensiveliteraturend a thoroughoverage an be found n Rousseeuw nd Leroy 1987).2. Main properties of the simplicial depth functionD(.). The mainpropertiesfD(.) are summarizednTheorems -4.THEOREM 1. For anyFon R Pandx E RP, supllXII2M D(x) - O asM --oo.THEOREM2 [Continuityf D( )]. IfF is an absolutelyontinuousistribu-tionon RP, thenD(-) is continuous.The next wo theoremsre statedfor ngularlyymmetricistributions.he

    reason we focuson thesedistributionss that theyform largeclass ofdistributionsossessingn obvious enter,ndwe shall showthatthis centeragreeswith he onepredicted ythe implicial epth unction.DEFINITION. A random ariableX inRP orits distributionis said tobeangularly ymmetricbout thepointb (in RP) if and only f therandomvariables (X - b)/JIX bII and -(X - b)/JIX bII are identically dis-tributed, here11 11tands or heEuclideannorm.For p = 2, F is angularly symmetricabout b simplymeans ab(O) = ab(O + IT)forall 9, 0 < 9 < iT, where b(') is the angulardensity roundthe pointbinduced yF providedhat uch ngular ensityxists. t iseasytosee that fFis symmetric bout b, thenF is angularly ymmetricbout b. It is also easy tosee that if F is angularly ymmetricbout b, thenany hyperplane assingthrough willdivide R into woopenhalf-spaces ith qual probabilities.hisprobabilitys ' ifthe distributions absolutelyontinuous. hus the center fangular ymmetrys what newouldwant s a (multivariate) edian.nview fthis and Theorem , t is onlynatural o define median ythemaximal oint

    ofD(*). Finally,we note hatthecenter fangular ymmetrys uniquewhen texists, xceptnthecase when hedistribution hasitswhole robability assconcentratedn a lineand itsprobabilityistributionlong hat inehas morethan one median. n fact, f b1 and b2 are twodifferententers f angularsymmetry,hen heregionetweenny woparallel yperplanesassing hroughb, and b2,respectively,ouldhavezeroprobability. otatinghese wohyper-

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    7/11

    410 R. Y. LIUplanes, it follows hat the entire1R except forthe line passing throughb1 andb2has zero probability.

    THEOREM 3 [Monotonicityf D(*)]. If F is absolutelyontinuousndangularlyymmetricbout heorigin, hen (ax) is a monotoneonincreasingin a 2 0 for ll x e RP.THEOREM4. If F is an absolutelyontinuous istributionn RP and itisangularly ymmetricboutb E R P, then (b) = 2 -P.In particular,Theorems3 and 4 imply hat for ny pointa in RP, D(a) < 2-Pif F is an angularly ymmetric istribution.Beforediscussing heproofs f Theorems1-4, we pause to maketwo remarks.REMARKA. Theorem3 is equivalent to sayingthat the contoursdefined y{x e RP: D(x) = c} forpositivenumbersc < 2-P are nestedwithin one an-other.As c decreases, heymovefurthernd furtherway from he center.Theirgeometry hould contain usefulinformationbout the distributionF. In thespecial case whenF is spherical, ach contour s a circle nd D(x) is a monotonicfunction f lixll.n the case of an ellipticaldistribution,.e., whenthe density tx is a function f x - t)'V-'(x - Ly),t is not hard to show that D(x) is also afunctionof (x - tt)'V- (x - t). In other words,the contoursof D(.) resemblethe contoursof the underlying ensity n the ellipticalcase. This observationagain confirms hat D(-) indeed providesus with an appropriatenotion ofordering.REMARKB. The ProofofTheorem4 will furthermply the following act:Under the assumptionof angularsymmetryt a centerbo,the conditionalSDvalue at bo givenone of the randomvertices s the same as the unconditionalone. In other words,

    (2.1) P(bo E S[X1,..., Xp+1]Xi) = 2 Pforeach i = 1,..., p + 1. Evidently, 2.1) implies hat 2-P - D (bo)) is a degen-erate U-statistic, hat is E[(Dn(bo) - 2-P)IX] = 0 for ll i, 1 < i < n.PROOFOF THEOREM . Givenx in DRP,we observe hat the event x ES[X1,..., Xp+1]} is contained in the event UJL11{lIXill llxll}. The theoremfollows.PROOF F THEOREM 2. Letp = 2for implicity.et x andybe twodistinctpoints.A randomtriangle an contributeo the difference(x) - D(y) only f tcontainsone pointbutnot theother.This howevermpliesthat theremustbe aline segment oining two data pointswhichintersects he line segmentxy. Itfollows hat ifxn is a sequence n R2 whichconverges o x, then

    ID(x) - D(Xn)l < 3P(An))

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    8/11

    DATA DEPTH BASED ON RANDOM SIMPLICES 411whereAn= {(X1,X2): X1X2 ntersectsx,}. Note thatlimsup,,- P(AJ) 0 for nya > 1 if hefollowingwo dditional onditions old:(i) f is positive n a neighborhoodf the origin,nd (ii) f is positiven aneighborhoodffixfor ome such hat1

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    9/11

    412 R. Y. LIUis the ithunitvector n iRP nd [[X1*,..., Xp*]] s the matrixwith olumnsx* x*.AI* I I I XwP L(iv) {(X1,..., Xp,Xp+?): W1 0,..., Wp 01, whereWi is the ith compo-nentof the vector [X1*,...,Xp +By exchangingX.* with - Xi*, we can show that the randomvector[[X1*,..., Xp ]]-1Xp+L is coordinatewiseymmetricbouttheorigin. his im-pliesthateach orthant eterminedy W1,...,Wp)'has an equal probability,which must be 2-P. Thereforehe event iv) has the probability-P andTheorem follows.l

    3. Consistencyof the sample simplicialdepthD.(-).THEOREM . Let F be an absolutelyontinuousistributionn RP withbounded ensity.Then:(a) TheuniformonsistencyfDn(-)holds, .e.,

    sup IDn(x)- D(x)l -* 0 a.s. as n -s oo.x E-RP(b) Furthermore,ffdoes notvanish n a neighborhoodfAand ifD(-) isuniquelymaximizedat M,uhen a1n-*i a.s. as n -x c.The proof f Theorem is basedon thefollowinghreeemmas.LEMMA 1. ForanyF onRP andx E RPi,

    sup Dn(x) -* 0 a.s. asM -x cc.IIXII2MLEMMA . Suppose hat is absolutelyontinuous.et 8 andc be arbitrarybut ixed ositive onstants.hen,for nypositive ,we have

    sup IDn(x) - D.(y)l < y(e) + 8 + Rn,{x, yEBall(g, c), IIx-yII

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    10/11

    DATA DEPTH BASED ON RANDOM SIMPLICES 413Glivenko-Cantellilass fF has a density .r.t. ebesguemeasurecf.Gaensslerand Stute 1979)]. n otherwords,

    sup jF.(A) - F(A)l -- 0 a.s.,A c-Wwhere is the class of all convex orelmeasurableets.We refer o Liu (1987)for hedetails. emma is needed ecauseDj(x) is a U-statistic.n fact, emma3 is essentially emmaA onpage185ofSerfling1980).Finally,wecome otheProof fTheorem .

    PROOF OF THEOREM 5. By Theorem andLemma1, part a) willfollowfwe can showthat, or chosenM > 0,(3.1) sup ID,(x) - D(x)l -O 0 a.s. as n -x 00,

    x Q(0, M)whereQ(,u,M) is thehypercube ith u s itscenter ndM as the ength f tssides.Divide each sideofQ(,, M) intoN equal pieces o formNP subhypercubess.In viewofTheorem and Lemma2,sinceN canbe arbitrarilyarge,weonlyneed to showthat(3.2) max IDj(x) - D(x)l -O 0 a.s. as n -+ oo,xec-C(L, M)whereC(Qu,M) is theset of ll corneroints fthesubhypercubes.Using Lemma 3 with m= p + 1, r = 4 and c = 1,we obtainP( max ID(x) -D(x)I >)

    < NP max P(IDn(x) - D(x)l > c) = O(n 2).x C(IA, M)The claim 3.2) thereforeollowsrom heBorel-Cantelliemma.The ideaoftheprooffpart b) can be outlined s follows:Webeginwith woballs,each centeredt ,u.The radius f thebigger all is arbitrarilymallbutfixed. hen it is shown hattheDn value at any point nside he nner all islarger hanthatat anypoint utside hebigger all,for ll largen. Since, or lllargen,at least one datapointwillfall nside he nner all, hepossibilityf ILnlying utside hebigger all s ruled ut.AssuminghatD(-) is uniquelymaximizedt u,we see that,for ny > 0,there xists 8 > 0 such thatD(x) < D(yt) - 8 for ll x i Ball(,4, ). By thecontinuityf D(.) (cf.Theorem ), we maychooseEl < 6 such that ID(y) -D(tQ)I < 8/2 for all y E Ball(,u, l). Thus, D(x) < D(y) - 8/2 for all x ?Ball(,i, ) and y E Ball(y,El).TheuniformonvergencefDntoD givennpart(a) ofTheorem guaranteeshat, tartingrom certain ,Dn(x) < Dn(y)- 8/4for ll x O Ball(,u, ) and yE Ball(,u, l).Nowwe claimthatthere s at leastonesamplepoint nside he smaller allBall(,u, 6) for ll largen, almost urely. incef does not vanishn a neighbor-hood of u, we have

    p P(X1 E Ball(ju, E6)) > 0.

  • 8/13/2019 Liu R. (1990) on a Notion of Data Depth Based on Random Simpleces

    11/11

    414 R. Y. LIUConsequently,

    P(Ball(M, e1) doesnotcontain ny ofX...., Xj) = (1 - p)n.Therefore,lmost urely,fterertain ,there xists ome ample oint, ayXk,inside Ball(M, ,). By thedefinitionf ,n, Dn(An) Dfl(Xk)and, hence, n EBall(M, ). SinceE can be chosen rbitrarilymall, art b) follows.1Acknowledgments. I would ike o thankJoopKempermanor is encour-agement nd manyhelpful iscussions. am gratefulo thereferee nd theEditorfor heirkind uggestions,hich elped mprovehearticle reatly.

    REFERENCESDONOHO,D. L. (1982). Breakdownproperties f multivariateocation estimators.Qualifying aper,Harvard Univ.DoNOHO, D. and GASKO, M. (1988). Multivariategeneralization f the medianand trimmedmeans.I. Unpublished.GAENSSLER, . and STUTE,W. (1979). Empiricalprocesses:A survey f resultsfor ndependent ndidenticallydistributed andomvariables. Ann. Probab. 7 193-243.GREGORY, G. G. (1977). Large sampletheory orU-statistics nd tests of fit.Ann. Statist.5 110-123.GROSS, S. and Liu, R. (1988). Classification ules based on concepts of data depth.Unpublished.Liu, R. (1987). Simplicial depthand the related ocationestimators. echnicalreport, ept. Statist.,RutgersUniv.Liu, R. and SINGH, K. (1988). On theordering fdirectional ata: Conceptsof data depthon circles

    and spheres.Unpublished.OJA, H. (1983). Descriptive tatisticsformultivariate istribution. tatist. Probab. Lett. 1 327-333.RouSSEEUW, P. J. and LEROY, A. (1987). Robust Regression nd OutliersDetection. Wiley,NewYork.SERFLING, R. J. (1980). Approximation heorems fMathematical Statistics.Wiley,New York.STAHEL, W. A. (1981). Robuste Schatzungen: InfinitesimaleOptimalitit und SchiitzungenvonKovarianzmatrizen. h.D. dissertation, TH, Zuirich.TUKEY, J. W. (1975). Mathematics and picturing ata. Proceedings of InternationalCongressofMathematics, Vancouver, 523-531.DEPARTMENT OF STATISTICSRUTGERS UNIVERSITYNEW BRUNSWICK, NEW JERSEY 08903


Recommended