+ All Categories
Home > Documents > Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For...

Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For...

Date post: 19-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
56
Optimal Rates For Density-Based Clustering Using DBSCAN Alessandro Rinaldo Department of Statistics and Data Science Carnegie Mellon University joint work with Daren Wang and Xinyang Lu September 8, 2018 WHOA-PSI 2018 A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 1/27
Transcript
Page 1: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Optimal Rates For Density-Based ClusteringUsing DBSCAN

Alessandro RinaldoDepartment of Statistics and Data Science

Carnegie Mellon University

joint work with Daren Wang and Xinyang Lu

September 8, 2018WHOA-PSI 2018

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 1/27

Page 2: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Inference for Clustering

Clustering is one of the oldest and most important problems in dataanalysis. There is a vast literature in Statistics, Machine Learning, CS,Probability, etc., and countless algorithms...

Abstract formulation: organize a set of objects into groups, so thatobjects in the same group are maximally similar and objects in differentgroups are maximally dissimilar.

From a statistical standpoint, the goals, scope and performance of theclustering task are often times poorly defined. In general, statisticalinference for clustering is relatively underdeveloped.

Today I would like to (1) make a case in support of density-basedclustering as a statistically principled paradigm for clustering and (2)presents some consistency results.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 2/27

Page 3: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Inference for Clustering

We are interested in clustering an i.i.d. sample Xn = (X1, . . . ,Xn) from anunknown probability distribution P in Rd , with d fixed(!), having a Lebesguedensity p. We wish to be as agnostic about P as possible.

-4 -2 0 2 4

-4-2

02

4

normal4[1:1000, ][,1]

norm

al4[

1:10

00, ]

[,2]

−4 −2 0 2 4 6 8−3

−2

−1

0

1

2

3

4

5

6

7

ENGY

TIM

E

EngyTime, n = 4096, dimension = 2, classes = 2, main problem: gaussian mixture

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

x

y

Target, n = 770, dimension = 2, classes = 6, main problem: outlier

−2

0

2

−1012

−1.5

−1

−0.5

0

0.5

1

1.5

x

Chainlink, n = 1000, dimension = 3, classes = 2, main problem: linear not separable

y

z

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 3/27

Page 4: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Density-Based Clustering Approach– Hartigan (1975, 1981)

For a fixed threshold λ > 0, the λ-upper level set (high-density region) ofp is

L(λ) = x ∈ Rd : p(x) ≥ λ.

A λ-cluster of P is a connected component of L(λ).

Alternatively, for α ∈ [0, 1], set λα = supλ : P (L(λ)) ≥ α andL(α) = L(λα). A α-cluster of P is a connected component of L(α):minimal volume set of prescribed probability content (see Rinaldo et al.,2012, and Chen, 2018+).

Cluster Tree

The family T of all λ-clusters is the cluster tree of P.T satisfies the treeproperty: A,B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

This hierarchy of inclusions form a dendrogram, with height parameter λ or α.

For topological and measure-theoretical details, see Steinwart (2015).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 4/27

Page 5: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Density-Based Clustering Approach– Hartigan (1975, 1981)

For a fixed threshold λ > 0, the λ-upper level set (high-density region) ofp is

L(λ) = x ∈ Rd : p(x) ≥ λ.

A λ-cluster of P is a connected component of L(λ).

Alternatively, for α ∈ [0, 1], set λα = supλ : P (L(λ)) ≥ α andL(α) = L(λα). A α-cluster of P is a connected component of L(α):minimal volume set of prescribed probability content (see Rinaldo et al.,2012, and Chen, 2018+).

Cluster Tree

The family T of all λ-clusters is the cluster tree of P.T satisfies the treeproperty: A,B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

This hierarchy of inclusions form a dendrogram, with height parameter λ or α.

For topological and measure-theoretical details, see Steinwart (2015).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 4/27

Page 6: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Density-Based Clustering Approach– Hartigan (1975, 1981)

For a fixed threshold λ > 0, the λ-upper level set (high-density region) ofp is

L(λ) = x ∈ Rd : p(x) ≥ λ.

A λ-cluster of P is a connected component of L(λ).

Alternatively, for α ∈ [0, 1], set λα = supλ : P (L(λ)) ≥ α andL(α) = L(λα). A α-cluster of P is a connected component of L(α):minimal volume set of prescribed probability content (see Rinaldo et al.,2012, and Chen, 2018+).

Cluster Tree

The family T of all λ-clusters is the cluster tree of P.T satisfies the treeproperty: A,B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

This hierarchy of inclusions form a dendrogram, with height parameter λ or α.

For topological and measure-theoretical details, see Steinwart (2015).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 4/27

Page 7: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Density-Based Clustering Approach– Hartigan (1975, 1981)

For a fixed threshold λ > 0, the λ-upper level set (high-density region) ofp is

L(λ) = x ∈ Rd : p(x) ≥ λ.

A λ-cluster of P is a connected component of L(λ).

Alternatively, for α ∈ [0, 1], set λα = supλ : P (L(λ)) ≥ α andL(α) = L(λα). A α-cluster of P is a connected component of L(α):minimal volume set of prescribed probability content (see Rinaldo et al.,2012, and Chen, 2018+).

Cluster Tree

The family T of all λ-clusters is the cluster tree of P.T satisfies the treeproperty: A,B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

This hierarchy of inclusions form a dendrogram, with height parameter λ or α.

For topological and measure-theoretical details, see Steinwart (2015).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 4/27

Page 8: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Cluster Tree

0 5 10

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 5/27

Page 9: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Cluster Tree

0 5 10

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 5/27

Page 10: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Cluster Tree

0 5 10

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 5/27

Page 11: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Cluster Tree

0 5 10

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 5/27

Page 12: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Cluster Tree

0 5 10

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 5/27

Page 13: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Cluster Tree

0 5 10

10.9

0.8

0.6

0.4

0.3

0.2

00.04

0.07

0.1

0.13

0.17

0.21

λ

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 5/27

Page 14: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The Cluster Tree

The cluster tree of P is a data structure encoding all the clusteringproperties of P: high-density + connectivity.

It represents the 0-th order persistent homology of P.

The “the number of clusters" depends on the height of the tree.

Figure 1:

47

In Kim et al. (2016) we look at the topology and a partial ordering overcluster trees.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 5/27

Page 15: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Literature

Vast theoretical literature on level set estimation and some on clustering:

Support estimation: Korostelev and Tsybakov (1993), Mammen andTsybakov (1995), Cuevas and Fraiman (1997), Biau, Cadre andPellettier (2008), Patschkowski and Rohde (2015).

Level set estimation for fixed λ: Polonik (1995), Tsybakov (1997),Walther (1997), Scott and Nowak (2006), Cuevas, González-Menteigaand Rodríguez-Casal (2006), Singh, Scott and Nowak (2009), Rigolletand Vert (2011), Rinaldo and Wasserman (2010).

Cluster tree estimation: Koltchinskii (2000), Stuetzle and Nugent (2010).More recently: Chaudhuri, Dasgupta, Kpotufe and von Luxburg (2013),Balakrishnan et al. (2013), Steinwart (2011, 2012, 2015), Sriperumbudurand Steinwart (2012, 2017), Kim et al. (2016).

Some algorithms: DBSCAN, HDBSCAN, OPTICS, denpro, pdfcluster,DeBaCl, etc.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 6/27

Page 16: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Back to Clustering the Sample Points...

Cluster Tree Estimator

A cluster tree estimator is collection of subsets of Xn with the tree property.

The best (oracle) estimator is the one whose clusters are the setsA ∩ Xn,A ∈ T (when non-empty). Of course, it is unreasonable to askthat any tree estimator will do as well as the oracle estimator.

The Chaudhuri-Dasgupta Approach

Define a separation criterion that quantifies how "far apart" any twohigh-density regions (clusters) are.

For each n, construct a collection An of high-density regions of P(possibly clusters) that fulfill said criterion in such a way that, as n→∞,the degree of separation is vanishing.

A cluster tree estimator Tn is consistent with respect to An when thefollowing holds, simultaneously over all A 6= A′ in An: with probabilitytending to 1 as n→∞, the smallest clusters in Tn containing A ∩ Xn andA′ ∩ Xn (if non-empty) are disjoint.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 7/27

Page 17: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Back to Clustering the Sample Points...

Cluster Tree Estimator

A cluster tree estimator is collection of subsets of Xn with the tree property.

The best (oracle) estimator is the one whose clusters are the setsA ∩ Xn,A ∈ T (when non-empty). Of course, it is unreasonable to askthat any tree estimator will do as well as the oracle estimator.

The Chaudhuri-Dasgupta Approach

Define a separation criterion that quantifies how "far apart" any twohigh-density regions (clusters) are.

For each n, construct a collection An of high-density regions of P(possibly clusters) that fulfill said criterion in such a way that, as n→∞,the degree of separation is vanishing.

A cluster tree estimator Tn is consistent with respect to An when thefollowing holds, simultaneously over all A 6= A′ in An: with probabilitytending to 1 as n→∞, the smallest clusters in Tn containing A ∩ Xn andA′ ∩ Xn (if non-empty) are disjoint.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 7/27

Page 18: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

A Blueprint for Constructing Cluster Tree Estimators

Typically, building a cluster tree estimator entails two steps:

Density estimation step (statistically hard): construct a density estimatorto determine the high-density points. In this talk, we will use KDEs: forh > 0 and a kernel K , set

x ∈ Rd 7→ ph(x) =1

nhd Cd

n∑i=1

K(

x − Xi

h

),

where Cd is a normalizing constant. Nearest neighborhood densityestimators could also be used.

Connectivity step (computationally hard): cluster the high-densitysample points. Using the topology of Rd is computationally unfeasible(even if p were known exactly). It is necessary to use someapproximation to speed things up.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 8/27

Page 19: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The DBSCAN-CT Estimator

For h > 0, let ph be the spherical kernel density estimator and let

X(1) ≤ . . . ≤ X(n)

be the sample points ordered based on their ph values.

Algorithm 1 The DBSCAN Cluster Tree Estimator1: Input i.i.d sample Xn and h > 0.2: for k ∈ 0, . . . , n − 1 do3: Construct a graph Gh,k with node set X(n−i) : i = 0, . . . , k and edge

set (X(i),X(j)) : ‖X(i) − X(j)‖ < 2h4: Compute C(h, k), the set of connected components of Gh,k .5: end for6: Output: Tn = C(h, k), k ∈ 0, . . . , n − 1

In the connectivity step 4: the topology of the support is approximatedwith that of the 2h-neighborhood graph over Xn, single-linkage style.Can be efficiently computed with a union-find algorithm.The original DBSCAN algorithm was designed for “flat clustering” at afixed k and is slightly different.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 9/27

Page 20: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

The DBSCAN-CT Estimator

For h > 0, let ph be the spherical kernel density estimator and let

X(1) ≤ . . . ≤ X(n)

be the sample points ordered based on their ph values.

Algorithm 2 The DBSCAN Cluster Tree Estimator1: Input i.i.d sample Xn and h > 0.2: for k ∈ 0, . . . , n − 1 do3: Construct a graph Gh,k with node set X(n−i) : i = 0, . . . , k and edge

set (X(i),X(j)) : ‖X(i) − X(j)‖ < 2h4: Compute C(h, k), the set of connected components of Gh,k .5: end for6: Output: Tn = C(h, k), k ∈ 0, . . . , n − 1

In the connectivity step 4: the topology of the support is approximatedwith that of the 2h-neighborhood graph over Xn, single-linkage style.Can be efficiently computed with a union-find algorithm.The original DBSCAN algorithm was designed for “flat clustering” at afixed k and is slightly different.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 9/27

Page 21: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Linking DBSCAN-CT to ph (also, why 2h?)

For any λ ≥ 0, setD(λ) = x : ph(x) ≥ λ ∩ Xn

andL(λ) =

⋃Xj∈D(λ)

B(Xj , h) (1)

Let k and h be the input to DBSCAN. Then the nodes of Gh,k is the set D(λk )

where λk = knhd vd

. Furthermore, two points Xi and Xj in D(λk ) are the in the

same connected component of L(λk ) if and only if they are in the samegraphical connected component of Gh,k . Consequently, for any pair A and A′

of subsets of Rd with A ∩ Xn 6= ∅ and A′ ∩ Xn 6= ∅,if A ⊂ L(λk ) is connected, all the sample points in A belong to the sameconnected component of Gh,k .

if A and A′ belongs to distinct connected components of L(λk ) , then thesample points in A and the sample points in A′ belong to distinctconnected components of Gh,k .

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 10/27

Page 22: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Illustration

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 11/27

Page 23: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Illustration

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 11/27

Page 24: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Illustration

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 11/27

Page 25: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Illustration

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 11/27

Page 26: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Consistency of DBSCAN-CT for Arbitrary Densities

(ε, σ)-separation criterion

Let ε, σ > 0. Two connected subsets A and A′ of the support of P are said tobe (ε, σ)-separated when

they belong to different connected components of L(λ∗ − ε), whereλ∗ = infx∈A∪A p(x) > ε, and

mink 6=l dist(Ck , Cl ) > σ, where C1, . . . , Cm are the connected componentsof L(λ∗ − ε).

The parameter σ quantifies the geometric (horizontal) distance while εmeasures the probabilistic (vertical) distance.

In addition, we avoid sets that are too thin.

Thick sets

A subset A is h-thick if A−h := x ∈ A : B(x , h) ⊂ A 6= ∅.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 12/27

Page 27: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Consistency of DBSCAN-CT for Arbitrary Densities

(ε, σ)-separation criterion

Let ε, σ > 0. Two connected subsets A and A′ of the support of P are said tobe (ε, σ)-separated when

they belong to different connected components of L(λ∗ − ε), whereλ∗ = infx∈A∪A p(x) > ε, and

mink 6=l dist(Ck , Cl ) > σ, where C1, . . . , Cm are the connected componentsof L(λ∗ − ε).

The parameter σ quantifies the geometric (horizontal) distance while εmeasures the probabilistic (vertical) distance.

In addition, we avoid sets that are too thin.

Thick sets

A subset A is h-thick if A−h := x ∈ A : B(x , h) ⊂ A 6= ∅.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 12/27

Page 28: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Consistency of DBSCAN-CT for Arbitrary Densities

(ε, σ)-consistency of DBSCAN-CT

For a constant C = C(d), when the DBSCAN-CT estimator is computed withparameter

σ/4 ≥ h ≥ C(

log nnε2

)1/d

,

the following holds with probability at least 1− 1n , uniformly over all 2h-thick

and (ε, σ) separated clusters A′ and A: A−2h ∩ Xn and A′−2h ∩ Xn (ifnon-empty) each belongs one separate cluster of Tn.

Paraphrasing, and ignoring log terms, the sample complexity for theDBSCAN-CT for (ε, σ)-consistency of 2h-thick clusters is

n ≥ C(d)1

ε2σd .

Note: we may let ε, σ, h→ 0 as n→∞

Optimality

The above sample complexity is minimax rate optimal, up to a log n factor.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 13/27

Page 29: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Consistency of DBSCAN-CT for Arbitrary Densities

(ε, σ)-consistency of DBSCAN-CT

For a constant C = C(d), when the DBSCAN-CT estimator is computed withparameter

σ/4 ≥ h ≥ C(

log nnε2

)1/d

,

the following holds with probability at least 1− 1n , uniformly over all 2h-thick

and (ε, σ) separated clusters A′ and A: A−2h ∩ Xn and A′−2h ∩ Xn (ifnon-empty) each belongs one separate cluster of Tn.

Paraphrasing, and ignoring log terms, the sample complexity for theDBSCAN-CT for (ε, σ)-consistency of 2h-thick clusters is

n ≥ C(d)1

ε2σd .

Note: we may let ε, σ, h→ 0 as n→∞

Optimality

The above sample complexity is minimax rate optimal, up to a log n factor.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 13/27

Page 30: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Comments

The previous result is nearly identical to (and follows easily from theproof of) the seminal consistency result for cluster tree estimation ofChaudhuri and Dasgupta (2010), who analyzed a different,computationally more expensive algorithm.

The (ε, σ)-separation criterion applies to arbitrary densities. Because ofthat, the vertical and horizontal separation parameters are essentiallydecoupled and, as a result, the sample complexity depends jointly onthem, i.e. on 1

ε2σd .

Furthermore, the resulting rate is agnostic to the degree of smoothnessof the underlying density...

...which begs the question:

Can faster rates be obtained with smoother densities?

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 14/27

Page 31: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Brief Recap of Nonparametric Density Estimation

A Lebesgue density p : Rd → R is said to belong the Hölder classΣ(L, α) with parameters α, L > 0 when, ∀x , y ∈ Rd and ∀s ∈ Nd with|s| :=

∑di=1 si = bαc,

|Dsp(x)− Dsp(y)| ≤ L‖x − y‖α−s.

When 0 < α ≤ 1, the above reduces to a Lipschitz condition on p.

Variance: If K is “well behaved”, then

P

(‖ph − ph‖∞ ≤ C

√log(1/γ) + log(1/h)

nhd

)≥ 1− γ, γ ∈ (0, 1)

where C = C(‖p‖∞, d ,K ) > 0 and ph(x) = E[ph(x)], x ∈ Rd .

Bias: If p ∈ Σ(L, α), then, for some C′′ = C”(L, α) > 0,

‖ph − p‖∞ ≤ C′hα.

If α > 1, the kernel K has to be α-valid (see Rigollet and Vert, 2009).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 15/27

Page 32: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Brief Recap of Nonparametric Density Estimation

A Lebesgue density p : Rd → R is said to belong the Hölder classΣ(L, α) with parameters α, L > 0 when, ∀x , y ∈ Rd and ∀s ∈ Nd with|s| :=

∑di=1 si = bαc,

|Dsp(x)− Dsp(y)| ≤ L‖x − y‖α−s.

When 0 < α ≤ 1, the above reduces to a Lipschitz condition on p.

Variance: If K is “well behaved”, then

P

(‖ph − ph‖∞ ≤ C

√log(1/γ) + log(1/h)

nhd

)≥ 1− γ, γ ∈ (0, 1)

where C = C(‖p‖∞, d ,K ) > 0 and ph(x) = E[ph(x)], x ∈ Rd .

Bias: If p ∈ Σ(L, α), then, for some C′′ = C”(L, α) > 0,

‖ph − p‖∞ ≤ C′hα.

If α > 1, the kernel K has to be α-valid (see Rigollet and Vert, 2009).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 15/27

Page 33: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

δ-Separation Criterion

For smooth densities, one can formulate a one-parameter criterion forcluster separation. The smoothness of the density ensures that thevertical and horizontal distances cannot be decoupled.

δ-separation criterion

Two connected subsets A and A′ of the support of p are δ-separated whenthey belong to distinct connected components of the level setx ∈ Rd : p(x) > λ∗ − δ, where λ∗ := infx∈A∪A′ p(x).

For continuous densities, this corresponds to the notion of mergedistance by Eldridge et al. (2015).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 16/27

Page 34: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

δ-Separation Criterion

For smooth densities, one can formulate a one-parameter criterion forcluster separation. The smoothness of the density ensures that thevertical and horizontal distances cannot be decoupled.

δ-separation criterion

Two connected subsets A and A′ of the support of p are δ-separated whenthey belong to distinct connected components of the level setx ∈ Rd : p(x) > λ∗ − δ, where λ∗ := infx∈A∪A′ p(x).

For continuous densities, this corresponds to the notion of mergedistance by Eldridge et al. (2015).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 16/27

Page 35: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

δ-Consistency

δ-Consistency

Let δnn ⊂ R+ and γnn ⊂ (0, 1) be vanishing.

A cluster tree estimator Tn is δn-consistent if, with probability no smallerthan 1− γn, for any pair of connected subsets A and A′ of the support ofp that are δn-separated, the two smallest clusters in Tn containing A ∩ Xn

and A′ ∩ Xn (if non-empty) are distinct.

The sequence δn then defines a consistency rate.

Interestingly, we need to distinguish two cases: α ≤ 1 and α > 1.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 17/27

Page 36: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

δ-Consistency

δ-Consistency

Let δnn ⊂ R+ and γnn ⊂ (0, 1) be vanishing.

A cluster tree estimator Tn is δn-consistent if, with probability no smallerthan 1− γn, for any pair of connected subsets A and A′ of the support ofp that are δn-separated, the two smallest clusters in Tn containing A ∩ Xn

and A′ ∩ Xn (if non-empty) are distinct.

The sequence δn then defines a consistency rate.

Interestingly, we need to distinguish two cases: α ≤ 1 and α > 1.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 17/27

Page 37: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

δ-Consistency: α ≤ 1

The DBSCAN-CT estimator with parameter h of order(log n

n

)1/(2α+d)is

δn-consistent with rate

δn ≥ C(

log nn

)α/(2α+d)

and γn =1n,

for a constant C = C(‖p‖∞, L, d).

The above rate is minimax optimal, up to a log n factor.

This is not surprising: Sriperumbudur and Steinwart (2012) have ananalogous result about DBSCAN in different settings.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 18/27

Page 38: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

δ-Consistency: α > 1

If p ∈ Σ(L, α) with α > 1, the DBSCAN-CT estimator is no longer rate optimalfor 2 types of reasons:

Statistical reasons: by standard non-parametric theory, in order tominimize the bias of ph we can no longer use a spherical kernel butinstead need to deploy a smoother kernel (e.g., an α-valid kernel). Easyto fix.

Computational reasons: when using a single-linkage type of method forthe connectivity step, the error we incur in approximating the level sets ofp is of order h, larger than the order of the bias, hα.So, when α > 1, DBSCAN-CT would not be optimal even if we could usethe true density to get the high-density points! Hard to fix.

Theoretically trivial but computationally unfeasible consistent estimator: use aα-valid kernel to get a rate-optimal KDE ph, and compute the connectedcomponents of the upper level set of ph to cluster the data.

Instead we will derive conditions on p ensuring that single-linkage still works!

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 19/27

Page 39: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Assumptions for α > 1

There exists a δ0 > 0 such that, for any split level λ∗ of p and any 0 < δ ≤ δ0,the set Ωδ = x : p(x) ≥ λ∗ + δ satisfies the

Standard Assumption: for any r ≥ 0 and x ∈ Ωδ,

Vol(B(x , r) ∩ Ωδ) vd r d .

The Covering Condition: for any 0 < r , there exists a collection of pointsNr ⊂ Ωδ such that card(Nr ) r−d and⋃

y∈Nr

B(y , r) ⊃ Ωδ.

Low Noise Condition: Letting Ckmk=1 be the connected components of

x : p(x) > λ∗, ,

mink 6=k′

dist(Ck ∩ p ≥ λ∗ + δ, Ck′ ∩ p ≥ λ∗ + δ) δ1/α.

There exist densities satisfying the above conditions, for any α > 1. Forexample, natural splines (α = 3) and Morse functions (α = 2).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 20/27

Page 40: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Assumptions for α > 1

There exists a δ0 > 0 such that, for any split level λ∗ of p and any 0 < δ ≤ δ0,the set Ωδ = x : p(x) ≥ λ∗ + δ satisfies the

Standard Assumption: for any r ≥ 0 and x ∈ Ωδ,

Vol(B(x , r) ∩ Ωδ) vd r d .

The Covering Condition: for any 0 < r , there exists a collection of pointsNr ⊂ Ωδ such that card(Nr ) r−d and⋃

y∈Nr

B(y , r) ⊃ Ωδ.

Low Noise Condition: Letting Ckmk=1 be the connected components of

x : p(x) > λ∗, ,

mink 6=k′

dist(Ck ∩ p ≥ λ∗ + δ, Ck′ ∩ p ≥ λ∗ + δ) δ1/α.

There exist densities satisfying the above conditions, for any α > 1. Forexample, natural splines (α = 3) and Morse functions (α = 2).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 20/27

Page 41: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

δ-Consistency: α > 1

Consider the DBSCAN-CT(α) estimator: same as the DBSCAN-CTestimator, except that a KDE with an α-valid kernel (α > 1) is used torank the data. It can similarly be computed with a union-find algorithm.

Let p ∈ Σ(L, α > 1) be any density function with compact connectedsupport and finitely many split levels bounded from below by λ0 > 0 andsatisfying the above conditions. Then, the DBSCAN-CT(α) withparameter h of order

(log n

n

)1/(2α+d)is δn -consistent with rate

δn (

log nn

)α/(2α+d)

,

and γ 1n + h−d exp−λ0nhd. The constants depend on d , L, ‖p‖∞

and α.

The above rate is minimax optimal, up to log n factors.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 21/27

Page 42: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

δ-Consistency: α > 1

Consider the DBSCAN-CT(α) estimator: same as the DBSCAN-CTestimator, except that a KDE with an α-valid kernel (α > 1) is used torank the data. It can similarly be computed with a union-find algorithm.

Let p ∈ Σ(L, α > 1) be any density function with compact connectedsupport and finitely many split levels bounded from below by λ0 > 0 andsatisfying the above conditions. Then, the DBSCAN-CT(α) withparameter h of order

(log n

n

)1/(2α+d)is δn -consistent with rate

δn (

log nn

)α/(2α+d)

,

and γ 1n + h−d exp−λ0nhd. The constants depend on d , L, ‖p‖∞

and α.

The above rate is minimax optimal, up to log n factors.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 21/27

Page 43: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Summary

We have shown that DBSCAN-based procedures deliver, under someconditions, optimal consistency rates for cluster tree estimation withfaster rates arising from smoother densities, while remainingcomputationally feasible.

Interestingly, the rates we obtain match the rates for minimax estimation ofdensities in Σ(L, α) in the ‖ · ‖∞ norm. Though in hindsight this may not bethat surprising, this result further confirm that that the ‖ · ‖∞ norm is a good(possibly the right) metric for cluster tree consistency. See also Kim et al.(2016).

It is surprising (to us) however that one can in principle perform optimalclustering using computationally feasible methods for the connectivitystep.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 22/27

Page 44: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Future work

For instance...

Choice of the tuning parameter h.

Adaptivity to α.

Extension to algorithms based on knn-graphs and knn-densityestimators.

Extension to non-Euclidean data.

Inference beyond consistency: confidence sets for the cluster tree.

Study single-linkage type procedures to estimate higher ordertopological properties of the densities. In an upcoming work we will showhow to draw statistical inference for the persistent homology ofhigh-density sets using the Rips-complex, further advancing a line ofwork initiated by Bobrowski, Mukherjee and Taylor, 2014.

Investigate if and how density-based clustering can work in highdimensions.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 23/27

Page 45: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Inference for Cluster Trees: Beyond Consistency

How do we carry out statistical inference for a discrete object such as acluster tree?

In Kim et al. (2016) we illustrate the difficulties in building confidencesets for density trees. We also propose bootstrap-based pruningmethods to “simplify" the tree and eliminate spurious clusters.

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Ring data, n = 1200

x

y

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.

0−

0.5

0.0

0.5

1.0

Mickey mouse data, n = 1200

x1

x2

−3 −2 −1 0 1 2

−3

−2

−1

01

2

Yingyang data, n = 3200

x

y

Ring data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.0

00.

208

0.27

20.

529

Mickey mouse data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.0

00.

255

0.29

1

Yingyang data, alpha = 0.05

lam

bda

0.0 0.2 0.4 0.6 0.8 1.0

00.

035

0.04

40.

052

0.07

It is worth noting that this is a strongly non-parametric problem for whichonly one-side inference (Donoho, 1988) is feasible: it is not possible tolearn the entire tree, but we may be able to discern parts of it.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 24/27

Page 46: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Density Clustering in High-Dimensions

Our results are established within the classic nonparametric frameworkleading to dimension-dependent rates! This would suggest that densityclustering cannot/should not be done in high-dimensions.

In fact, empirically that is not the case (see next slide).

Dimension dependent rates is a bias issue, which calls for a vanishing h.But if we keep h fixed and we target ph instead of p, then the dimensionno longer affects the rate. (It is still in the constants!)

Intuition: in high-dimensions, density clustering may still work providedthat good clustering solutions come from heavily biased densityestimators. This paradigm requires a different analysis.

-5 0 5 -5 0 5

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 25/27

Page 47: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

High-Dimensional Density-Based Clustering: Population Admixture

Data from The Human Genome Diversity Project (HGDP) dataset,available at

http://www.hagsc.org/hgdp/files.html.

Cleaned-up dataset comprised of 11,775 SNPs from 931 subjects from53 populations from Crosset et al. (2010).

The goal of the analysis is to identify the hierarchy of high-densityclusters of individuals in the sample, ideally capturing the correctmembership in populations.

We use DeBaCl, a density cluster algorithm that uses knn-graphs.Below, In the first level set tree k = 40, in the second k = 6.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 26/27

Page 48: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

High-Dimensional Density-Based Clustering: Population Admixture

44 CHAPTER 3. DATA EXPLORATION AND CLUSTERING

(a) (b)

Figure 3.6: Oversmoothed level set tree result for population genetics. (a) Whenthe smoothing parameter k is set to 40, the estimated level set tree for the HGDPpopulation genetic data has only four all-mode clusters. (b) The map shows theseclusters describe continent groups well, but do not capture more detailed populationaliations. Each pie chart on the map represents a true population, and the slicesof each pie represent the contribution from each cluster (matched by color to thedendrogram). The clusters in this result are very well matched to populations.

(a) (b)

Figure 3.7: Exploring high-density population genetics clusters. The level set treeis constructed with k = 6, yielding a more detailed and multi-scale set of high-density clusters. (a) Cutting the tree at a low ↵ level yields continent groups andhighly dissimilar populations (Figure 3.8a). (b) Cutting at a higher ↵ level produceshigh-density clusters that better capture individual populations (Figure 3.8b).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 26/27

Page 49: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

High-Dimensional Density-Based Clustering: Population Admixture

44 CHAPTER 3. DATA EXPLORATION AND CLUSTERING

(a) (b)

Figure 3.6: Oversmoothed level set tree result for population genetics. (a) Whenthe smoothing parameter k is set to 40, the estimated level set tree for the HGDPpopulation genetic data has only four all-mode clusters. (b) The map shows theseclusters describe continent groups well, but do not capture more detailed populationaliations. Each pie chart on the map represents a true population, and the slicesof each pie represent the contribution from each cluster (matched by color to thedendrogram). The clusters in this result are very well matched to populations.

(a) (b)

Figure 3.7: Exploring high-density population genetics clusters. The level set treeis constructed with k = 6, yielding a more detailed and multi-scale set of high-density clusters. (a) Cutting the tree at a low ↵ level yields continent groups andhighly dissimilar populations (Figure 3.8a). (b) Cutting at a higher ↵ level produceshigh-density clusters that better capture individual populations (Figure 3.8b).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 26/27

Page 50: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

High-Dimensional Density-Based Clustering: Population Admixture3.4. HIGH-DIMENSIONAL EXPERIMENTS 45

(a)

(b)

Figure 3.8: Maps indicating clusters for the dendrograms in Figure 3.7. Each piechart on the map represents a true population, and the slices of each pie representthe contribution from each cluster (matched by color to the respective dendrogramin Figure 3.7). (a) High-density clusters for a low ↵ cut of the tree (Figure 3.7a). Pop-ulations in Papua New Guinea and the Americas are identified, but only continentgroups are recovered in Africa, Eurasia, and East Asia. (b) High-density clusters forthe high ↵ cut (Figure 3.7b) correspond better to individual populations.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 26/27

Page 51: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

High-Dimensional Density-Based Clustering: Population Admixture

44 CHAPTER 3. DATA EXPLORATION AND CLUSTERING

(a) (b)

Figure 3.6: Oversmoothed level set tree result for population genetics. (a) Whenthe smoothing parameter k is set to 40, the estimated level set tree for the HGDPpopulation genetic data has only four all-mode clusters. (b) The map shows theseclusters describe continent groups well, but do not capture more detailed populationaliations. Each pie chart on the map represents a true population, and the slicesof each pie represent the contribution from each cluster (matched by color to thedendrogram). The clusters in this result are very well matched to populations.

(a) (b)

Figure 3.7: Exploring high-density population genetics clusters. The level set treeis constructed with k = 6, yielding a more detailed and multi-scale set of high-density clusters. (a) Cutting the tree at a low ↵ level yields continent groups andhighly dissimilar populations (Figure 3.8a). (b) Cutting at a higher ↵ level produceshigh-density clusters that better capture individual populations (Figure 3.8b).

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 26/27

Page 52: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

High-Dimensional Density-Based Clustering: Population Admixture

3.4. HIGH-DIMENSIONAL EXPERIMENTS 45

(a)

(b)

Figure 3.8: Maps indicating clusters for the dendrograms in Figure 3.7. Each piechart on the map represents a true population, and the slices of each pie representthe contribution from each cluster (matched by color to the respective dendrogramin Figure 3.7). (a) High-density clusters for a low ↵ cut of the tree (Figure 3.7a). Pop-ulations in Papua New Guinea and the Americas are identified, but only continentgroups are recovered in Africa, Eurasia, and East Asia. (b) High-density clusters forthe high ↵ cut (Figure 3.7b) correspond better to individual populations.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 26/27

Page 53: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Density-Based Clustering of Functional Data

Density-based clustering algorithms can be used in general metric spacesand in particular with functional data.

The phoneme dataset, from Ferraty and Vieu (2006) containslog-periodograms of 2000 instances of digitized human speech, dividedevenly between five phonemes: “sh", “dcl" (as in “dark"), “iy" (as in thevowel of “she"), “aa", and “ao". Each recording is treated as a singlefunctional observation, which was smoothed using a cubic spline.

Distance between function is the L2 distace (each phoneme is observedover 150 frequencies).

We use DeBacl with k = 20.

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 27/27

Page 54: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Density-Based Clustering of Functional Data

50 CHAPTER 4. PSEUDO-DENSITIES

(a) (b)

(c) (d)

(e) (f)

Figure 4.1: Spoken phoneme data, separated into the 5 true classes. Each phoneme isrepresented by the log-periodogram at 150 fixed frequencies (see Hastie et al. (1995)for details), which we smoothed with a cubic spline. The phonemes are (a) ‘sh’, (b)‘iy’, (c) ‘dcl’, (d) ‘aa’, and (e) ‘ao’. (f) The mean function for each phoneme; colorindicates correspondence between phoneme class and mean function.

what contrived to use to a pseudo-density estimate instead of a bona fide density.

Furthermore, the phoneme dataset has been curated carefully by Hastie et al. (1995)

and Ferraty and Vieu (2006) for the purpose of illustrating statistical methods.

Hurricane trajectories are also easily visualized functional data, but they are

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 27/27

Page 55: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Density-Based Clustering of Functional Data

4.3. HURRICANE TRACKS 51

Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-cate higher pseudo-density values, implying greater proximity to neighbors.

(a) (b)

Figure 4.3: (a) Phoneme level set tree. (b) All-mode clusters. The modal observationin each all-mode cluster is shown with a black outline.

sampled irregularly (in space), have variable lengths, and lie in an ambient dimension

larger than d = 1, making them much more like white matter fiber streamlines. We

show that level set trees can be used e↵ectively to describe the topography of a

hurricane track dataset and to gather tracks into coherent clusters, opening a new

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 27/27

Page 56: Optimal Rates For Density-Based Clustering Using DBSCANarinaldo/WHOA-PSI-2018.pdf · For topological and measure-theoretical details, seeSteinwart (2015). A. Rinaldo Optimal Rates

Density-Based Clustering of Functional Data

4.3. HURRICANE TRACKS 51

Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-cate higher pseudo-density values, implying greater proximity to neighbors.

(a) (b)

Figure 4.3: (a) Phoneme level set tree. (b) All-mode clusters. The modal observationin each all-mode cluster is shown with a black outline.

sampled irregularly (in space), have variable lengths, and lie in an ambient dimension

larger than d = 1, making them much more like white matter fiber streamlines. We

show that level set trees can be used e↵ectively to describe the topography of a

hurricane track dataset and to gather tracks into coherent clusters, opening a new

A. Rinaldo Optimal Rates For Density-Based Clustering Using DBSCAN 27/27


Recommended