Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For...

Optimal Rates For Density-Based ClusteringUsing DBSCAN

Alessandro RinaldoDepartment of Statistics and Data Science

Carnegie Mellon University

joint work with Daren Wang and Xinyang Lu

September 8, 2018WHOA-PSI 2018

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 1/27

Inference for Clustering

Clustering is one of the oldest and most important problems in dataanalysis. There is a vast literature in Statistics, Machine Learning, CS,Probability, etc., and countless algorithms...

Abstract formulation: organize a set of objects into groups, so thatobjects in the same group are maximally similar and objects in differentgroups are maximally dissimilar.

From a statistical standpoint, the goals, scope and performance of theclustering task are often times poorly defined. In general, statisticalinference for clustering is relatively underdeveloped.

Today I would like to (1) make a case in support of density-basedclustering as a statistically principled paradigm for clustering and (2)presents some consistency results.


Inference for Clustering

We are interested in clustering an i.i.d. sample Xn = (X1, . . . ,Xn) from anunknown probability distribution P in Rd , with d fixed(!), having a Lebesguedensity p. We wish to be as agnostic about P as possible.

-4 -2 0 2 4

-4-2

02

4

normal4[1:1000, ][,1]

norm

al4[

1:10

00, ]

[,2]

−4 −2 0 2 4 6 8−3

−2

−1

0

1

2

3

4

5

6

7

ENGY

TIM

E

EngyTime, n = 4096, dimension = 2, classes = 2, main problem: gaussian mixture

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

Target, n = 770, dimension = 2, classes = 6, main problem: outlier

−2

0

2

−1012

−1.5

−1

−0.5

0

0.5

1

1.5

x

Chainlink, n = 1000, dimension = 3, classes = 2, main problem: linear not separable

y

z


The Density-Based Clustering Approach– Hartigan (1975, 1981)

For a fixed threshold λ > 0, the λ-upper level set (high-density region) ofp is

L(λ) = x ∈ Rd : p(x) ≥ λ.

A λ-cluster of P is a connected component of L(λ).

Alternatively, for α ∈ [0, 1], set λα = supλ : P (L(λ)) ≥ α andL(α) = L(λα). A α-cluster of P is a connected component of L(α):minimal volume set of prescribed probability content (see Rinaldo et al.,2012, and Chen, 2018+).

Cluster Tree

The family T of all λ-clusters is the cluster tree of P.T satisfies the treeproperty: A,B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

This hierarchy of inclusions form a dendrogram, with height parameter λ or α.

For topological and measure-theoretical details, see Steinwart (2015).




L(λ) = x ∈ Rd : p(x) ≥ λ.



Cluster Tree


A ⊂ B or B ⊂ A or A ∩ B = ∅.






L(λ) = x ∈ Rd : p(x) ≥ λ.



Cluster Tree


A ⊂ B or B ⊂ A or A ∩ B = ∅.






L(λ) = x ∈ Rd : p(x) ≥ λ.



Cluster Tree


A ⊂ B or B ⊂ A or A ∩ B = ∅.




The Cluster Tree

0 5 10


The Cluster Tree

0 5 10


The Cluster Tree

0 5 10


The Cluster Tree

0 5 10


The Cluster Tree

0 5 10


The Cluster Tree

0 5 10

10.9

0.8

0.6

0.4

0.3

0.2

0α

00.04

0.07

0.1

0.13

0.17

0.21

λ


The Cluster Tree

The cluster tree of P is a data structure encoding all the clusteringproperties of P: high-density + connectivity.

It represents the 0-th order persistent homology of P.

The “the number of clusters" depends on the height of the tree.

Figure 1:

47

In Kim et al. (2016) we look at the topology and a partial ordering overcluster trees.


Literature

Vast theoretical literature on level set estimation and some on clustering:

Support estimation: Korostelev and Tsybakov (1993), Mammen andTsybakov (1995), Cuevas and Fraiman (1997), Biau, Cadre andPellettier (2008), Patschkowski and Rohde (2015).

Level set estimation for fixed λ: Polonik (1995), Tsybakov (1997),Walther (1997), Scott and Nowak (2006), Cuevas, González-Menteigaand Rodríguez-Casal (2006), Singh, Scott and Nowak (2009), Rigolletand Vert (2011), Rinaldo and Wasserman (2010).

Cluster tree estimation: Koltchinskii (2000), Stuetzle and Nugent (2010).More recently: Chaudhuri, Dasgupta, Kpotufe and von Luxburg (2013),Balakrishnan et al. (2013), Steinwart (2011, 2012, 2015), Sriperumbudurand Steinwart (2012, 2017), Kim et al. (2016).

Some algorithms: DBSCAN, HDBSCAN, OPTICS, denpro, pdfcluster,DeBaCl, etc.


Back to Clustering the Sample Points...

Cluster Tree Estimator

A cluster tree estimator is collection of subsets of Xn with the tree property.

The best (oracle) estimator is the one whose clusters are the setsA ∩ Xn,A ∈ T (when non-empty). Of course, it is unreasonable to askthat any tree estimator will do as well as the oracle estimator.

The Chaudhuri-Dasgupta Approach

Define a separation criterion that quantifies how "far apart" any twohigh-density regions (clusters) are.

For each n, construct a collection An of high-density regions of P(possibly clusters) that fulfill said criterion in such a way that, as n→∞,the degree of separation is vanishing.

A cluster tree estimator Tn is consistent with respect to An when thefollowing holds, simultaneously over all A 6= A′ in An: with probabilitytending to 1 as n→∞, the smallest clusters in Tn containing A ∩ Xn andA′ ∩ Xn (if non-empty) are disjoint.


Back to Clustering the Sample Points...

Cluster Tree Estimator

A cluster tree estimator is collection of subsets of Xn with the tree property.

The best (oracle) estimator is the one whose clusters are the setsA ∩ Xn,A ∈ T (when non-empty). Of course, it is unreasonable to askthat any tree estimator will do as well as the oracle estimator.

The Chaudhuri-Dasgupta Approach

Define a separation criterion that quantifies how "far apart" any twohigh-density regions (clusters) are.

For each n, construct a collection An of high-density regions of P(possibly clusters) that fulfill said criterion in such a way that, as n→∞,the degree of separation is vanishing.

A cluster tree estimator Tn is consistent with respect to An when thefollowing holds, simultaneously over all A 6= A′ in An: with probabilitytending to 1 as n→∞, the smallest clusters in Tn containing A ∩ Xn andA′ ∩ Xn (if non-empty) are disjoint.


A Blueprint for Constructing Cluster Tree Estimators

Typically, building a cluster tree estimator entails two steps:

Density estimation step (statistically hard): construct a density estimatorto determine the high-density points. In this talk, we will use KDEs: forh > 0 and a kernel K , set

x ∈ Rd 7→ ph(x) =1

nhd Cd

n∑i=1

K(

x − Xi

h

),

where Cd is a normalizing constant. Nearest neighborhood densityestimators could also be used.

Connectivity step (computationally hard): cluster the high-densitysample points. Using the topology of Rd is computationally unfeasible(even if p were known exactly). It is necessary to use someapproximation to speed things up.


The DBSCAN-CT Estimator

For h > 0, let ph be the spherical kernel density estimator and let

X(1) ≤ . . . ≤ X(n)

be the sample points ordered based on their ph values.

Algorithm 1 The DBSCAN Cluster Tree Estimator1: Input i.i.d sample Xn and h > 0.2: for k ∈ 0, . . . , n − 1 do3: Construct a graph Gh,k with node set X(n−i) : i = 0, . . . , k and edge

set (X(i),X(j)) : ‖X(i) − X(j)‖ < 2h4: Compute C(h, k), the set of connected components of Gh,k .5: end for6: Output: Tn = C(h, k), k ∈ 0, . . . , n − 1

In the connectivity step 4: the topology of the support is approximatedwith that of the 2h-neighborhood graph over Xn, single-linkage style.Can be efficiently computed with a union-find algorithm.The original DBSCAN algorithm was designed for “flat clustering” at afixed k and is slightly different.


The DBSCAN-CT Estimator

For h > 0, let ph be the spherical kernel density estimator and let

X(1) ≤ . . . ≤ X(n)

be the sample points ordered based on their ph values.

Algorithm 2 The DBSCAN Cluster Tree Estimator1: Input i.i.d sample Xn and h > 0.2: for k ∈ 0, . . . , n − 1 do3: Construct a graph Gh,k with node set X(n−i) : i = 0, . . . , k and edge

set (X(i),X(j)) : ‖X(i) − X(j)‖ < 2h4: Compute C(h, k), the set of connected components of Gh,k .5: end for6: Output: Tn = C(h, k), k ∈ 0, . . . , n − 1

In the connectivity step 4: the topology of the support is approximatedwith that of the 2h-neighborhood graph over Xn, single-linkage style.Can be efficiently computed with a union-find algorithm.The original DBSCAN algorithm was designed for “flat clustering” at afixed k and is slightly different.


Linking DBSCAN-CT to ph (also, why 2h?)

For any λ ≥ 0, setD(λ) = x : ph(x) ≥ λ ∩ Xn

andL(λ) =

⋃Xj∈D(λ)

B(Xj , h) (1)

Let k and h be the input to DBSCAN. Then the nodes of Gh,k is the set D(λk )

where λk = knhd vd

. Furthermore, two points Xi and Xj in D(λk ) are the in the

same connected component of L(λk ) if and only if they are in the samegraphical connected component of Gh,k . Consequently, for any pair A and A′

of subsets of Rd with A ∩ Xn 6= ∅ and A′ ∩ Xn 6= ∅,if A ⊂ L(λk ) is connected, all the sample points in A belong to the sameconnected component of Gh,k .

if A and A′ belongs to distinct connected components of L(λk ) , then thesample points in A and the sample points in A′ belong to distinctconnected components of Gh,k .


Illustration


Illustration


Illustration


Illustration


Consistency of DBSCAN-CT for Arbitrary Densities

(ε, σ)-separation criterion

Let ε, σ > 0. Two connected subsets A and A′ of the support of P are said tobe (ε, σ)-separated when

they belong to different connected components of L(λ∗ − ε), whereλ∗ = infx∈A∪A p(x) > ε, and

mink 6=l dist(Ck , Cl ) > σ, where C1, . . . , Cm are the connected componentsof L(λ∗ − ε).

The parameter σ quantifies the geometric (horizontal) distance while εmeasures the probabilistic (vertical) distance.

In addition, we avoid sets that are too thin.

Thick sets

A subset A is h-thick if A−h := x ∈ A : B(x , h) ⊂ A 6= ∅.



(ε, σ)-separation criterion

Let ε, σ > 0. Two connected subsets A and A′ of the support of P are said tobe (ε, σ)-separated when

they belong to different connected components of L(λ∗ − ε), whereλ∗ = infx∈A∪A p(x) > ε, and

mink 6=l dist(Ck , Cl ) > σ, where C1, . . . , Cm are the connected componentsof L(λ∗ − ε).

The parameter σ quantifies the geometric (horizontal) distance while εmeasures the probabilistic (vertical) distance.

In addition, we avoid sets that are too thin.

Thick sets

A subset A is h-thick if A−h := x ∈ A : B(x , h) ⊂ A 6= ∅.



(ε, σ)-consistency of DBSCAN-CT

For a constant C = C(d), when the DBSCAN-CT estimator is computed withparameter

σ/4 ≥ h ≥ C(

log nnε2

)1/d

,

the following holds with probability at least 1− 1n , uniformly over all 2h-thick

and (ε, σ) separated clusters A′ and A: A−2h ∩ Xn and A′−2h ∩ Xn (ifnon-empty) each belongs one separate cluster of Tn.

Paraphrasing, and ignoring log terms, the sample complexity for theDBSCAN-CT for (ε, σ)-consistency of 2h-thick clusters is

n ≥ C(d)1

ε2σd .

Note: we may let ε, σ, h→ 0 as n→∞

Optimality

The above sample complexity is minimax rate optimal, up to a log n factor.



(ε, σ)-consistency of DBSCAN-CT

For a constant C = C(d), when the DBSCAN-CT estimator is computed withparameter

σ/4 ≥ h ≥ C(

log nnε2

)1/d

,

the following holds with probability at least 1− 1n , uniformly over all 2h-thick

and (ε, σ) separated clusters A′ and A: A−2h ∩ Xn and A′−2h ∩ Xn (ifnon-empty) each belongs one separate cluster of Tn.

Paraphrasing, and ignoring log terms, the sample complexity for theDBSCAN-CT for (ε, σ)-consistency of 2h-thick clusters is

n ≥ C(d)1

ε2σd .

Note: we may let ε, σ, h→ 0 as n→∞

Optimality

The above sample complexity is minimax rate optimal, up to a log n factor.


Comments

The previous result is nearly identical to (and follows easily from theproof of) the seminal consistency result for cluster tree estimation ofChaudhuri and Dasgupta (2010), who analyzed a different,computationally more expensive algorithm.

The (ε, σ)-separation criterion applies to arbitrary densities. Because ofthat, the vertical and horizontal separation parameters are essentiallydecoupled and, as a result, the sample complexity depends jointly onthem, i.e. on 1

ε2σd .

Furthermore, the resulting rate is agnostic to the degree of smoothnessof the underlying density...

...which begs the question:

Can faster rates be obtained with smoother densities?


Brief Recap of Nonparametric Density Estimation

A Lebesgue density p : Rd → R is said to belong the Hölder classΣ(L, α) with parameters α, L > 0 when, ∀x , y ∈ Rd and ∀s ∈ Nd with|s| :=

∑di=1 si = bαc,

|Dsp(x)− Dsp(y)| ≤ L‖x − y‖α−s.

When 0 < α ≤ 1, the above reduces to a Lipschitz condition on p.

Variance: If K is “well behaved”, then

P

(‖ph − ph‖∞ ≤ C

√log(1/γ) + log(1/h)

nhd

)≥ 1− γ, γ ∈ (0, 1)

where C = C(‖p‖∞, d ,K ) > 0 and ph(x) = E[ph(x)], x ∈ Rd .

Bias: If p ∈ Σ(L, α), then, for some C′′ = C”(L, α) > 0,

‖ph − p‖∞ ≤ C′hα.

If α > 1, the kernel K has to be α-valid (see Rigollet and Vert, 2009).


Brief Recap of Nonparametric Density Estimation

A Lebesgue density p : Rd → R is said to belong the Hölder classΣ(L, α) with parameters α, L > 0 when, ∀x , y ∈ Rd and ∀s ∈ Nd with|s| :=

∑di=1 si = bαc,

|Dsp(x)− Dsp(y)| ≤ L‖x − y‖α−s.

When 0 < α ≤ 1, the above reduces to a Lipschitz condition on p.

Variance: If K is “well behaved”, then

P

(‖ph − ph‖∞ ≤ C

√log(1/γ) + log(1/h)

nhd

)≥ 1− γ, γ ∈ (0, 1)

where C = C(‖p‖∞, d ,K ) > 0 and ph(x) = E[ph(x)], x ∈ Rd .

Bias: If p ∈ Σ(L, α), then, for some C′′ = C”(L, α) > 0,

‖ph − p‖∞ ≤ C′hα.

If α > 1, the kernel K has to be α-valid (see Rigollet and Vert, 2009).


δ-Separation Criterion

For smooth densities, one can formulate a one-parameter criterion forcluster separation. The smoothness of the density ensures that thevertical and horizontal distances cannot be decoupled.

δ-separation criterion

Two connected subsets A and A′ of the support of p are δ-separated whenthey belong to distinct connected components of the level setx ∈ Rd : p(x) > λ∗ − δ, where λ∗ := infx∈A∪A′ p(x).

For continuous densities, this corresponds to the notion of mergedistance by Eldridge et al. (2015).


δ-Separation Criterion

For smooth densities, one can formulate a one-parameter criterion forcluster separation. The smoothness of the density ensures that thevertical and horizontal distances cannot be decoupled.

δ-separation criterion

Two connected subsets A and A′ of the support of p are δ-separated whenthey belong to distinct connected components of the level setx ∈ Rd : p(x) > λ∗ − δ, where λ∗ := infx∈A∪A′ p(x).

For continuous densities, this corresponds to the notion of mergedistance by Eldridge et al. (2015).


δ-Consistency

δ-Consistency

Let δnn ⊂ R+ and γnn ⊂ (0, 1) be vanishing.

A cluster tree estimator Tn is δn-consistent if, with probability no smallerthan 1− γn, for any pair of connected subsets A and A′ of the support ofp that are δn-separated, the two smallest clusters in Tn containing A ∩ Xn

and A′ ∩ Xn (if non-empty) are distinct.

The sequence δn then defines a consistency rate.

Interestingly, we need to distinguish two cases: α ≤ 1 and α > 1.


δ-Consistency

δ-Consistency

Let δnn ⊂ R+ and γnn ⊂ (0, 1) be vanishing.

A cluster tree estimator Tn is δn-consistent if, with probability no smallerthan 1− γn, for any pair of connected subsets A and A′ of the support ofp that are δn-separated, the two smallest clusters in Tn containing A ∩ Xn

and A′ ∩ Xn (if non-empty) are distinct.

The sequence δn then defines a consistency rate.

Interestingly, we need to distinguish two cases: α ≤ 1 and α > 1.


δ-Consistency: α ≤ 1

The DBSCAN-CT estimator with parameter h of order(log n

n

)1/(2α+d)is

δn-consistent with rate

δn ≥ C(

log nn

)α/(2α+d)

and γn =1n,

for a constant C = C(‖p‖∞, L, d).

The above rate is minimax optimal, up to a log n factor.

This is not surprising: Sriperumbudur and Steinwart (2012) have ananalogous result about DBSCAN in different settings.


δ-Consistency: α > 1

If p ∈ Σ(L, α) with α > 1, the DBSCAN-CT estimator is no longer rate optimalfor 2 types of reasons:

Statistical reasons: by standard non-parametric theory, in order tominimize the bias of ph we can no longer use a spherical kernel butinstead need to deploy a smoother kernel (e.g., an α-valid kernel). Easyto fix.

Computational reasons: when using a single-linkage type of method forthe connectivity step, the error we incur in approximating the level sets ofp is of order h, larger than the order of the bias, hα.So, when α > 1, DBSCAN-CT would not be optimal even if we could usethe true density to get the high-density points! Hard to fix.

Theoretically trivial but computationally unfeasible consistent estimator: use aα-valid kernel to get a rate-optimal KDE ph, and compute the connectedcomponents of the upper level set of ph to cluster the data.

Instead we will derive conditions on p ensuring that single-linkage still works!


Assumptions for α > 1

There exists a δ0 > 0 such that, for any split level λ∗ of p and any 0 < δ ≤ δ0,the set Ωδ = x : p(x) ≥ λ∗ + δ satisfies the

Standard Assumption: for any r ≥ 0 and x ∈ Ωδ,

Vol(B(x , r) ∩ Ωδ) vd r d .

The Covering Condition: for any 0 < r , there exists a collection of pointsNr ⊂ Ωδ such that card(Nr ) r−d and⋃

y∈Nr

B(y , r) ⊃ Ωδ.

Low Noise Condition: Letting Ckmk=1 be the connected components of

x : p(x) > λ∗, ,

mink 6=k′

dist(Ck ∩ p ≥ λ∗ + δ, Ck′ ∩ p ≥ λ∗ + δ) δ1/α.

There exist densities satisfying the above conditions, for any α > 1. Forexample, natural splines (α = 3) and Morse functions (α = 2).


Assumptions for α > 1

There exists a δ0 > 0 such that, for any split level λ∗ of p and any 0 < δ ≤ δ0,the set Ωδ = x : p(x) ≥ λ∗ + δ satisfies the

Standard Assumption: for any r ≥ 0 and x ∈ Ωδ,

Vol(B(x , r) ∩ Ωδ) vd r d .

The Covering Condition: for any 0 < r , there exists a collection of pointsNr ⊂ Ωδ such that card(Nr ) r−d and⋃

y∈Nr

B(y , r) ⊃ Ωδ.

Low Noise Condition: Letting Ckmk=1 be the connected components of

x : p(x) > λ∗, ,

mink 6=k′

dist(Ck ∩ p ≥ λ∗ + δ, Ck′ ∩ p ≥ λ∗ + δ) δ1/α.

There exist densities satisfying the above conditions, for any α > 1. Forexample, natural splines (α = 3) and Morse functions (α = 2).



Consider the DBSCAN-CT(α) estimator: same as the DBSCAN-CTestimator, except that a KDE with an α-valid kernel (α > 1) is used torank the data. It can similarly be computed with a union-find algorithm.

Let p ∈ Σ(L, α > 1) be any density function with compact connectedsupport and finitely many split levels bounded from below by λ0 > 0 andsatisfying the above conditions. Then, the DBSCAN-CT(α) withparameter h of order

(log n

n

)1/(2α+d)is δn -consistent with rate

δn (

log nn

)α/(2α+d)

,

and γ 1n + h−d exp−λ0nhd. The constants depend on d , L, ‖p‖∞

and α.

The above rate is minimax optimal, up to log n factors.



Consider the DBSCAN-CT(α) estimator: same as the DBSCAN-CTestimator, except that a KDE with an α-valid kernel (α > 1) is used torank the data. It can similarly be computed with a union-find algorithm.

Let p ∈ Σ(L, α > 1) be any density function with compact connectedsupport and finitely many split levels bounded from below by λ0 > 0 andsatisfying the above conditions. Then, the DBSCAN-CT(α) withparameter h of order

(log n

n

)1/(2α+d)is δn -consistent with rate

δn (

log nn

)α/(2α+d)

,

and γ 1n + h−d exp−λ0nhd. The constants depend on d , L, ‖p‖∞

and α.

The above rate is minimax optimal, up to log n factors.


Summary

We have shown that DBSCAN-based procedures deliver, under someconditions, optimal consistency rates for cluster tree estimation withfaster rates arising from smoother densities, while remainingcomputationally feasible.

Interestingly, the rates we obtain match the rates for minimax estimation ofdensities in Σ(L, α) in the ‖ · ‖∞ norm. Though in hindsight this may not bethat surprising, this result further confirm that that the ‖ · ‖∞ norm is a good(possibly the right) metric for cluster tree consistency. See also Kim et al.(2016).

It is surprising (to us) however that one can in principle perform optimalclustering using computationally feasible methods for the connectivitystep.


Future work

For instance...

Choice of the tuning parameter h.

Adaptivity to α.

Extension to algorithms based on knn-graphs and knn-densityestimators.

Extension to non-Euclidean data.

Inference beyond consistency: confidence sets for the cluster tree.

Study single-linkage type procedures to estimate higher ordertopological properties of the densities. In an upcoming work we will showhow to draw statistical inference for the persistent homology ofhigh-density sets using the Rips-complex, further advancing a line ofwork initiated by Bobrowski, Mukherjee and Taylor, 2014.

Investigate if and how density-based clustering can work in highdimensions.


Inference for Cluster Trees: Beyond Consistency

How do we carry out statistical inference for a discrete object such as acluster tree?

In Kim et al. (2016) we illustrate the difficulties in building confidencesets for density trees. We also propose bootstrap-based pruningmethods to “simplify" the tree and eliminate spurious clusters.

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Ring data, n = 1200

x

y

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.

0−

0.5

0.0

0.5

1.0

Mickey mouse data, n = 1200

x1

x2

−3 −2 −1 0 1 2

−3

−2

−1

01

2

Yingyang data, n = 3200

x

y

Ring data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.0

00.

208

0.27

20.

529

−

−

Mickey mouse data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.0

00.

255

0.29

1

−

−

Yingyang data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.0

00.

035

0.04

40.

052

0.07

−

−

It is worth noting that this is a strongly non-parametric problem for whichonly one-side inference (Donoho, 1988) is feasible: it is not possible tolearn the entire tree, but we may be able to discern parts of it.


Density Clustering in High-Dimensions

Our results are established within the classic nonparametric frameworkleading to dimension-dependent rates! This would suggest that densityclustering cannot/should not be done in high-dimensions.

In fact, empirically that is not the case (see next slide).

Dimension dependent rates is a bias issue, which calls for a vanishing h.But if we keep h fixed and we target ph instead of p, then the dimensionno longer affects the rate. (It is still in the constants!)

Intuition: in high-dimensions, density clustering may still work providedthat good clustering solutions come from heavily biased densityestimators. This paradigm requires a different analysis.

-5 0 5 -5 0 5


High-Dimensional Density-Based Clustering: Population Admixture

Data from The Human Genome Diversity Project (HGDP) dataset,available at

http://www.hagsc.org/hgdp/files.html.

Cleaned-up dataset comprised of 11,775 SNPs from 931 subjects from53 populations from Crosset et al. (2010).

The goal of the analysis is to identify the hierarchy of high-densityclusters of individuals in the sample, ideally capturing the correctmembership in populations.

We use DeBaCl, a density cluster algorithm that uses knn-graphs.Below, In the first level set tree k = 40, in the second k = 6.


http://www.hagsc.org/hgdp/files. html.


44 CHAPTER 3. DATA EXPLORATION AND CLUSTERING

(a) (b)

Figure 3.6: Oversmoothed level set tree result for population genetics. (a) Whenthe smoothing parameter k is set to 40, the estimated level set tree for the HGDPpopulation genetic data has only four all-mode clusters. (b) The map shows theseclusters describe continent groups well, but do not capture more detailed populationaliations. Each pie chart on the map represents a true population, and the slicesof each pie represent the contribution from each cluster (matched by color to thedendrogram). The clusters in this result are very well matched to populations.

(a) (b)

Figure 3.7: Exploring high-density population genetics clusters. The level set treeis constructed with k = 6, yielding a more detailed and multi-scale set of high-density clusters. (a) Cutting the tree at a low ↵ level yields continent groups andhighly dissimilar populations (Figure 3.8a). (b) Cutting at a higher ↵ level produceshigh-density clusters that better capture individual populations (Figure 3.8b).




(a) (b)


(a) (b)



High-Dimensional Density-Based Clustering: Population Admixture3.4. HIGH-DIMENSIONAL EXPERIMENTS 45

(a)

(b)

Figure 3.8: Maps indicating clusters for the dendrograms in Figure 3.7. Each piechart on the map represents a true population, and the slices of each pie representthe contribution from each cluster (matched by color to the respective dendrogramin Figure 3.7). (a) High-density clusters for a low ↵ cut of the tree (Figure 3.7a). Pop-ulations in Papua New Guinea and the Americas are identified, but only continentgroups are recovered in Africa, Eurasia, and East Asia. (b) High-density clusters forthe high ↵ cut (Figure 3.7b) correspond better to individual populations.




(a) (b)


(a) (b)




3.4. HIGH-DIMENSIONAL EXPERIMENTS 45

(a)

(b)

Figure 3.8: Maps indicating clusters for the dendrograms in Figure 3.7. Each piechart on the map represents a true population, and the slices of each pie representthe contribution from each cluster (matched by color to the respective dendrogramin Figure 3.7). (a) High-density clusters for a low ↵ cut of the tree (Figure 3.7a). Pop-ulations in Papua New Guinea and the Americas are identified, but only continentgroups are recovered in Africa, Eurasia, and East Asia. (b) High-density clusters forthe high ↵ cut (Figure 3.7b) correspond better to individual populations.


Density-Based Clustering of Functional Data

Density-based clustering algorithms can be used in general metric spacesand in particular with functional data.

The phoneme dataset, from Ferraty and Vieu (2006) containslog-periodograms of 2000 instances of digitized human speech, dividedevenly between five phonemes: “sh", “dcl" (as in “dark"), “iy" (as in thevowel of “she"), “aa", and “ao". Each recording is treated as a singlefunctional observation, which was smoothed using a cubic spline.

Distance between function is the L2 distace (each phoneme is observedover 150 frequencies).

We use DeBacl with k = 20.



50 CHAPTER 4. PSEUDO-DENSITIES

(a) (b)

(c) (d)

(e) (f)

Figure 4.1: Spoken phoneme data, separated into the 5 true classes. Each phoneme isrepresented by the log-periodogram at 150 fixed frequencies (see Hastie et al. (1995)for details), which we smoothed with a cubic spline. The phonemes are (a) ‘sh’, (b)‘iy’, (c) ‘dcl’, (d) ‘aa’, and (e) ‘ao’. (f) The mean function for each phoneme; colorindicates correspondence between phoneme class and mean function.

what contrived to use to a pseudo-density estimate instead of a bona fide density.

Furthermore, the phoneme dataset has been curated carefully by Hastie et al. (1995)

and Ferraty and Vieu (2006) for the purpose of illustrating statistical methods.

Hurricane trajectories are also easily visualized functional data, but they are



4.3. HURRICANE TRACKS 51

Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-cate higher pseudo-density values, implying greater proximity to neighbors.

(a) (b)

Figure 4.3: (a) Phoneme level set tree. (b) All-mode clusters. The modal observationin each all-mode cluster is shown with a black outline.

sampled irregularly (in space), have variable lengths, and lie in an ambient dimension

larger than d = 1, making them much more like white matter fiber streamlines. We

show that level set trees can be used e↵ectively to describe the topography of a

hurricane track dataset and to gather tracks into coherent clusters, opening a new



4.3. HURRICANE TRACKS 51

Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-cate higher pseudo-density values, implying greater proximity to neighbors.

(a) (b)

Figure 4.3: (a) Phoneme level set tree. (b) All-mode clusters. The modal observationin each all-mode cluster is shown with a black outline.

sampled irregularly (in space), have variable lengths, and lie in an ambient dimension

larger than d = 1, making them much more like white matter fiber streamlines. We

show that level set trees can be used e↵ectively to describe the topography of a

hurricane track dataset and to gather tracks into coherent clusters, opening a new


Date post:	19-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For...

Documents