A Convex Hull Peeling Depth Approach toNonparametric Massive Multivariate Data
Analysis with Applications
Hyunsook Lee.
Department of Statistics
The Pennsylvania State University
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 1
Outlines
I Convex Hull Peeling (CHP) and Multivariate Data Analysis
Definitions on CHP
Data Depth (Ordering Multivariate Data)
Quantiles and Density Estimation
I Color Magnitude (CM) Diagram and Sloan Digital Sky Survey
I Nonparametric Descriptive Statistics with CHP
Multivariate Median
Skewness and Kurtosis of a Multivariate Distribution
I Outlier Detection with CHP
Level α ; Shape Distortion; Balloon Plot
I Concluding Remarks
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 2
Definitions
Convex Set A set C ⊆ Rd is convex if for every two points x, y ∈ C thewhole segment xy is also contained in C.
Convex Hull The convex hull of a set of points X in Rd is denoted byCH(X), is the intersection of all convex sets in Rd containing X. In
algorithms, a convex hull indicates points of a shape invariant minimal
subset of CH(X) (vertices, extreme points), connecting these points
produces a wrap of CH(X).
−2 −1 0 1 2
−2
−1
01
2
x
y
−2 −1 0 1 2
−2
−1
01
2
x
y
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 3
Convex Hull Peeling
−2 −1 0 1 2
−2
−1
01
2
x
yBefore
−2 −1 0 1 2
−2
−1
01
2x
y
After
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 4
Convex Hull Peeling Depth (CHPD)
[CHPD:] For a point x ∈ Rd and the data set X = {X1, ..., XN−1}, letC1 = CH{x, X} and denote a set of its vertices V1. We can getCj = Cj−1\Vj−1 through CHP until x ∈ Vj (j = 2, ...). Then,
CHPD(x) =](∪k
i=1Vi)
Nfor k s.t. k = minj{j : x ∈ Vj} ; otherwise CHPD
is 0.
I Tukey (1974): Locating data center (median) by the Convex Hull PeelingProcess.
I Barnett (1976): Ordering based on Depth
I p̂th quantiles are 1 − p̂thCHPDs.
I Hyper-polygons of 1 − p̂th depth obtainable from any dimensional data.
I QHULL(Barber et. al., 1996) works for general dimensions (http://qhull.org).
I Why CHPD...
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 5
Challenges inNonparametric Multivariate Analysis
How to Order Multivariate Data?
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 6
Challenges inNonparametric Multivariate Analysis
How to Order Multivariate Data?
Ordering Multivariate Data → Data DepthI Mahalanobis Depth : Mahalanobis (1936)
I Convex Hull Peeling Depth: Barnett (1976)
I Half Space Depth: Tukey (1975)
I Simplical Depth : Liu (1990)
I Oja Depth : Oja (1983)
I Majority Depth : Singh (1991)
I Ordering is not uniformly defined
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 6
Statistical Data Depth(Zuo and Serfling, 2000)
(P1) (Affine invariance) D(Ax + b; FAX+b) = D(x; FX) for all X (Anonsingular matrix) holds for any random vector X in Rd, any d × d
nonsingular matrix A, and any d-vector b;
(P2) (Maximality at center) D(θ; F ) = supx∈Rd D(x; F ) holds for anyF ∈ F having center θ;
(P3) (Monotonicity) for any F ∈ F having deepest point θ,D(x; F ) ≤ D(θ + α(x − θ); F ) holds for α ∈ [0, 1]; and
(P4) D(x; F ) → 0 as ||x|| → ∞, for each F ∈ F .
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 7
Convex Hull Peeling Depth
I affine invariance
I maximality at center
I monotonicity relative to deepest point
I vanishing at infinity
CHPD has these properties and points of smallest depth are possibleoutliers
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 8
Quantile Estimation
I Median: A point(s) left after peeling(will show robustness of this estimator later)
I pth Quantile: Level set whose central region contains ∼ 100p% data(will define the level set and the central region later)
I No Closed Form; Empirical Process
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 9
Empirical Density Estimation
Density Estimation with CHPD on Bivariate Normal Data (McDermott, 2003)
100000 Bivariate Normal SampleQuantiles={0.99,0.95,0.90,0.80,...0.20,0.10,0.05,0.01}
I
−4 −2 0 2 4
−4−2
02
4
x
y
−4 −3 −2 −1 0 1 2 3 40.00
0.05
0.10
0.15
0.20
−3−2−1 0 1 2
3
x
y
dens
ity
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 10
Lessons and Further Studies
I Sample from a convex distribution (no doughnut shape)
I Works on Massive data−→ Sequential Method
I Without previous knowledge, no model or prior is known to start ananalysis. Exploratory data analysis for a large database
I Nonparametric and non-distance based approach
I Where CHP can be applied and how?−→ Multi-color diagram from astronomy, where a plethora of freedata archives is available.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 11
Color Magnitude diagram
Two dimensional Color-Color diagram orCelebrated Hertzsprung-Russell diagram (switch)
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12
Color Magnitude diagram
Two dimensional Color-Color diagram orCelebrated Hertzsprung-Russell diagram (switch)
What if we can see beyond 2 dimensions without bias (projection)Then, 3 or higher dimensional color diagrams might have popularity.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12
Color Magnitude diagram
Two dimensional Color-Color diagram orCelebrated Hertzsprung-Russell diagram (switch)
What if we can see beyond 2 dimensions without bias (projection)Then, 3 or higher dimensional color diagrams might have popularity.
CHP may assist analyzing multi-color diagrams.Need a suitable data set with colors.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12
Sloan Digital Sky Survey: SDSS
Commissioned 2000, now Data Release 5 is available.5 bands; 4 variables (u-g, g-r, r-i, i-z)
I Studies on analyzing astronomical massive data received spotlightsrecently. http://www.sdss.org
I July, 2005: Data Release Four6670 square degrees, 180 million objectsAvailable from http://www.sdss.org/dr4From SpecPhotoAll with SQL:
I Attributes of photometric data are color indices, u,b,g,i,z along withcoordinates.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 13
SQL for SDSS
select ra, dec, z, psfMag_u, psfMag_g, psfMag_r,
psfMag_i, psfMag_z
from SpecPhotoAll
where specclass= 2
I Note — 2: galaxies, 3: QSO, 4: HighZ QSO
I Galaxies: 499043
I Quasars: 70204
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 14
Multivariate Descriptive Statistics
I CHP Median
I CHP Skewness
I CHP Kurtosis
with bivariate simulated data and SDSS DR4
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 15
Convex Hull Peeling Median (CHPM)
Multivariate Median: the inner most point among data→ Survey of Multivariate Median (Small, 1990)
CHPM: recursive peeling leads to the inner most point(s). The averageof these largest depth points is the median of a data set.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 16
Convex Hull Peeling Median (CHPM)
Multivariate Median: the inner most point among data→ Survey of Multivariate Median (Small, 1990)
CHPM: recursive peeling leads to the inner most point(s). The averageof these largest depth points is the median of a data set.
Simulations: Sample from standard bivariate normal distributionn mean median CHPM
104 (0.001338, -0.02232) (-0.005305, -0.01643) (0.000918, -0.010589)
106 (0.000072, 0.000114) (0.001185, -0.000717) (0.002455, -0.000456)
Sequential CHPM → (0.004741, -0.004111)
Setting for the sequential method: m=10000 and d=0.05
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 16
Application: Median
Quasars u-g g-r r-i i-z
Mean 0.4619 0.2484 0.1649 0.1008
Median 0.2520 0.1750 0.1520 0.0770
CHPM 0.2530 0.1640 0.1913 0.0700
Galaxies u-g g-r r-i i-z
Mean 1.622 0.9211 0.4226 0.3439
Median 1.680 0.8930 0.4200 0.3540
CHPM 1.790 0.957 0.424 0.367
Seq. CHPM 1.772 0.950 0.4228 0.363
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 17
Robustness of Convex Hull Peeling Median
Breakdown point of a convex hull peeling median goes to zero asn → ∞ (Donoho, 1982). Outliers are necessarily located at infinity.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 18
Robustness of Convex Hull Peeling Median
Breakdown point of a convex hull peeling median goes to zero asn → ∞ (Donoho, 1982). Outliers are necessarily located at infinity.
Empirical mean square error (EMSE) and Relative Efficiency (RE):Model:(1 − ε)N((0, 0), I) + εN(·, 4I)
n = 5000, m = 500, Tj=(CHPM, Mean)EMSE = 1
m
∑mi=1 ||Tj − µ||2
N((5, 5)t, 4I) N((10, 10)t, 4I)
ε CHPM Mean RE CHPM Mean RE
0 0.002178 0.000417 0.191689 0.002178 0.000417 0.191689
0.005 0.0028521 0.001682 0.589961 0.002891 0.005444 1.88291
0.05 0.016842 0.125522 7.45262 0.017824 0.500610 28.08597
0.2 0.139215 2.00109 14.37612 0.1435910 8.0017 55.7264
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 18
Generalized Quantile Process
EinMahl and Mason (1992)
Un(t) = inf{λ(A) : Pn(A) ≥ t, A ∈ A}, 0 < t < 1.
I Central Region:RCH(t) = {x ∈ R
d : CHPD(x) ≥ t}
I Level Set:BCH(t) = ∂RCH(t)
= {x ∈ Rd : CHPD(x) = t}
I Volume Functional:VCH(t) = V olume(RCH(t))
−→ One dimensional mapping.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 19
−4 −2 0 2 4
−4−2
02
4
x
y
−4 −2 0 2 4
02
46
8
x
y
−3 −2 −1 0 1 2 3
−3−2
−10
12
3
x
y
−50 0 50 100
−200
010
030
0
x
y
→ not equi-probability contours, assume smooth convex distributions
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 20
Skewness Measure
Let xj,i be the ith vertex in a level set BCH,j comprised by the jth
peeling process. A measure of skewness:
Rj =maxi ||xj,i − CHPM || − mini ||xj,i − CHPM ||
mini ||xj,i − CHPM ||
Not only a sequence of Rj visualizes but also quantizes the skewnessalong depths.Denominator for the regularization → affine invariant Rj
symmetric: flat Rj along convex hull peels
skewed: fluctuating Rj
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 21
Simulation: Skewness Measure
−3 −2 −1 0 1 2 3
−2
02
N(0, I)
−4 −2 0 2
−2
02
N(0, Σ)
5 10 15 20 25 30
−3
−2
−1
01
23
χ102
N(0
, 1)
non normal
0 20 40 60 80
0.0
0.5
1.0
1.5
2.0
2.5
jthconvex hull level set
Rj
0 20 40 60 80
1.0
1.5
2.0
2.5
jthconvex hull level set
Rj
0 20 40 60 802
34
56
jthconvex hull level set
Rj
2000 Hyunsook Lee, Department of Statistics, Penn State Univ – p. 22
Application: Skewness Measure (Quasars)
0 20 40 60 80
24
68
1012
14
jthconvex hull level set
Rj
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 23
Application: Skewness Measure (Galaxies)
0 50 100 150 200
24
68
1012
jthconvex hull level set
Rj
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 24
Kurtosis Measure
Quantile (Depth) based Kurtosis:
KCH(r) =VCH( 1
2 − r2 ) + VCH( 1
2 + r2 ) − 2VCH( 1
2 )
VCH( 12 − r
2 ) − VCH( 12 + r
2 )
Tailweight:
t(r, s) =VCH(r)
VCH(s)
for 0 < s < r ≤ 1. Here,VCH(r) indicates the volume functional at depth r.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 25
Simulation: Kurtosis Measure (Tailweight)
1.0 0.8 0.6 0.4 0.2 0.0
0.0
0.2
0.4
0.6
0.8
1.0
r : depth
t(r, r
min)
uniformnormalt10
cauchy
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 26
Application: Kurtosis Measure (Quasars)
1.0 0.8 0.6 0.4 0.2 0.0
0.0
0.2
0.4
0.6
0.8
1.0
depth
v(r,
r min)
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 27
Multivariate Outlier Detection
I What are Outliers?I Detecting Algorithms
Level α
Shape DistortionBalloon Plot
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 28
What are Outliers?
Outliers are...
I Cumbersome Observations
I Lead to New Scientific Discoveries
I Improve Models (Robust Statistics)
I ...
I No Clear Objectives but Come Along Often
CHP: Experience and relative Robustness support the Idea of OutlierDetection.⇒ We need a clear definition on outliers; especially, outliers of the 21stcentury. And outlier detecting methods.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 29
Outliers are observations....
I Huber (1972): unlikely to belong to the main population.
I Barnett and Leroy (1994): appear inconsistent with the remainder.
I Hawkins (1980): deviated so much to arouse suspicion.
I Beckman and Cook (1983): surprising and discrepant to theinvestigator.Discordant Observations or Contaminants
I Rohlf (1975): somewhat isolated from the main cloud of points.
Yet, somewhat VAGUE!
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 30
Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ...
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31
Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ...
Multivariate: Mostly bivariate applications
I Generalized Gap Test (Rolhf, 1975)
I Bivariate Box Plot (Zani et. al, 1999)
I Sunburst Plot (Liu et. al., 1999)
I Bag plot (Miller et. al., 2003)
and Mahalanobis distance D(x) = (x − µ̂)Σ̂−1(x − µ̂).
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31
Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ...
Multivariate: Mostly bivariate applications
I Generalized Gap Test (Rolhf, 1975)
I Bivariate Box Plot (Zani et. al, 1999)
I Sunburst Plot (Liu et. al., 1999)
I Bag plot (Miller et. al., 2003)
and Mahalanobis distance D(x) = (x − µ̂)Σ̂−1(x − µ̂).
Difficulties of multivariate analysis arise from the complexity of orderingmultivariate data.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31
Quantile Based Outlier Detection
−4 −2 0 2 4
−4
−2
02
4
x
y
bivariate standard normal
−20 −10 0 10 20
−20
−10
010
20
x
y
bivariate t5 with ρ=−0.5
−4 −3 −2 −1 0 1 2 3 40.00
0.05
0.10
0.15
0.20
−3−2−1 0 1 2
3
x
y
dens
ity
−6 −4 −2 0 2 4 60.00
0.05
0.10
0.15
0.20
−6−4−2 0 2 4
6
x
y
dens
ity
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 32
Contour Shape Changes
−3 −2 −1 0 1 2 3−
20
2
x
y
Bivariate Normal Sample
−2 0 2 4 6
−2
02
4
x
y
Outliers Added
−3 −2 −1 0 1 2 3
−2
02
x
y
−2 0 2 4 6−
20
24
x
y
0 20 40 60 80
010
2030
4050
jthconvex hull level set
Vol
j
0 20 40 60 80
010
2030
4050
jthconvex hull level set
Vol
j
GAP
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 33
Balloon Plot
−3 −2 −1 0 1 2 3
−2
02
x
y
Bivariate Normal Sample
−2 0 2 4 6
−2
02
4
x
y
Outliers Added
A Balloon Plot is obtained by blowing .5th CHPD polyhedron by 1.5times (lengthwise). Let V.5 be a set of vertices of .5th CHPD hull. Theballoon for outlier detection is
B1.5 = {yi : yi = xi + 1.5(xi − CHPM), xi ∈ V.5}.
In other words, blow the balloon of IQR 1.5 times larger.Hyunsook Lee, Department of Statistics, Penn State Univ – p. 34
Outliers in Quasar Population
0 20 40 60 80
010
020
030
040
0
jthconvex hull level set
Vol
j
0 20 40 60 80
01
23
4jthconvex hull level set
Vol
j0.25
Volumes of 1st CH, .01 Depth CH, .05 Depth CH: (474.134, 14.442,4.353)
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 35
Outliers in Galaxy Population
0 50 100 150 200
010
0020
0030
0040
0050
00
j (peel)
Vol
j
0 50 100 150 200
02
46
8j(peel)
Vol
j0.25
Volumes of 1st CH, .01 Depth CH, .05 Depth CH: (4919.492, 4.310,1.075)
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 36
Discussion on CHP
Convex Hull Peeling is..
I a robust location estimator.
I a tool for descriptive statistics.Skewness and Kurtosis measure.
I a reasonable approach for detecting multivariate outliers.
I a starter for clustering.
⇒Our methods help to characterize multivariate distributions andidentify outlier candidates from multivariate massive data; therefore,the results initiate scientists to study further with less bias.CHP as Exploratory Data Analysis and Data Mining Tools.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 37
Concluding Remarks
Drawbacks of CHPD
I Limited to moderate dimension data.
I CHPD estimates depths inward not outward.
I Convexity of a data set.
I No population/theoretical counterpart.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 38
Concluding Remarks
Drawbacks of CHPD
I Limited to moderate dimension data.
I CHPD estimates depths inward not outward.
I Convexity of a data set.
I No population/theoretical counterpart.
No assumption on data distribution, Non-distancebased, Affine invariant, Applicable to streaming data,Detecting Outliers, Providing Multivariate DescriptiveStatistics, Exploratory data analysis
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 38