+ All Categories
Home > Documents > Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology...

Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology...

Date post: 25-Apr-2018
Category:
Upload: halien
View: 220 times
Download: 5 times
Share this document with a friend
78
1/37 Topology Statistics More details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik University of Florida Department of Mathematics, [email protected] http://people.clas.ufl.edu/peterbubenik/ January 23, 2017 Tercera Escuela de An´ alisis Topol´ ogico de Datos y Topolog´ ıa Estoc´ astica ABACUS, Estado de M´ exico Peter Bubenik Introduction to Topological Data Analysis
Transcript
Page 1: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

1/37

Topology Statistics More details

Topology for Data Science 1:An Introduction to Topological Data Analysis

Peter Bubenik

University of FloridaDepartment of Mathematics,[email protected]

http://people.clas.ufl.edu/peterbubenik/

January 23, 2017

Tercera Escuela de Analisis Topologico de Datosy Topologıa Estocastica

ABACUS, Estado de Mexico

Peter Bubenik Introduction to Topological Data Analysis

Page 2: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

2/37

Topology Statistics More details Homology Persistent homology

Topological Data Analysis

What is topology and why use it to analyze data?

Topology is a branch of mathematics which is good at extractingglobal qualitative features from complicated geometric structures.

Example of a topological question

Is a given graph connected?

Topological Data Analysis

uses topology to summarize and learn from the “shape” of data.

Peter Bubenik Introduction to Topological Data Analysis

Page 3: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

3/37

Topology Statistics More details Homology Persistent homology

Simplicial complexes

Peter Bubenik Introduction to Topological Data Analysis

Page 4: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

4/37

Topology Statistics More details Homology Persistent homology

Exercise 1: Simplicial complexes for computers

A

B

C

D

What is the corresponding abstract simplicial complex?

{{A}, {B}, {C}, {D}, {A,B}, {A,D}, {B,C}, {B,D},{C ,D}, {B,C ,D}}

Peter Bubenik Introduction to Topological Data Analysis

Page 5: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

4/37

Topology Statistics More details Homology Persistent homology

Exercise 1: Simplicial complexes for computers

A

B

C

D

What is the corresponding abstract simplicial complex?

{{A}, {B}, {C}, {D}, {A,B}, {A,D}, {B,C}, {B,D},{C ,D}, {B,C ,D}}

Peter Bubenik Introduction to Topological Data Analysis

Page 6: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

5/37

Topology Statistics More details Homology Persistent homology

Exercise 2: Betti numbers of simplicial complexes

β0 = # of connected components

β1 = # of holes

β2 = # of voids

β0 =

3

β1 =

1

β2 =

1

Peter Bubenik Introduction to Topological Data Analysis

Page 7: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

5/37

Topology Statistics More details Homology Persistent homology

Exercise 2: Betti numbers of simplicial complexes

β0 = # of connected components

β1 = # of holes

β2 = # of voids

β0 = 3

β1 = 1

β2 = 1

Peter Bubenik Introduction to Topological Data Analysis

Page 8: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

6/37

Topology Statistics More details Homology Persistent homology

Homology of simplicial complexes

Definition

Homology in degree k is given by k-cycles modulo thek-boundaries.

Peter Bubenik Introduction to Topological Data Analysis

Page 9: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

6/37

Topology Statistics More details Homology Persistent homology

Homology of simplicial complexes

Definition

Homology in degree k is given by k-cycles modulo thek-boundaries.

Peter Bubenik Introduction to Topological Data Analysis

Page 10: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

7/37

Topology Statistics More details Homology Persistent homology

Exercise 3: Homology via linear algebra

A

B

C

DDimensions of vectors spacesof k-chains:

dim(C0) =

4

dim(C1) =

5

dim(C2) =

1

Boundary matrices:

∂0 = 0

∂1 =

1 1 0 0 01 0 1 1 00 0 1 0 10 1 0 1 1

∂2 =

00111

∂3 = 0

β0 = nullity(∂0)− rank(∂1) =

4− 3 = 1

β1 = nullity(∂1)− rank(∂2) =

2− 1 = 1

β2 = nullity(∂2)− rank(∂3) =

0− 0 = 0

Peter Bubenik Introduction to Topological Data Analysis

Page 11: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

7/37

Topology Statistics More details Homology Persistent homology

Exercise 3: Homology via linear algebra

A

B

C

DDimensions of vectors spacesof k-chains:

dim(C0) = 4

dim(C1) = 5

dim(C2) = 1

Boundary matrices:

∂0 = 0

∂1 =

1 1 0 0 01 0 1 1 00 0 1 0 10 1 0 1 1

∂2 =

00111

∂3 = 0

β0 = nullity(∂0)− rank(∂1) =

4− 3 = 1

β1 = nullity(∂1)− rank(∂2) =

2− 1 = 1

β2 = nullity(∂2)− rank(∂3) =

0− 0 = 0

Peter Bubenik Introduction to Topological Data Analysis

Page 12: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

7/37

Topology Statistics More details Homology Persistent homology

Exercise 3: Homology via linear algebra

A

B

C

DDimensions of vectors spacesof k-chains:

dim(C0) = 4

dim(C1) = 5

dim(C2) = 1

Boundary matrices:

∂0 = 0

∂1 =

1 1 0 0 01 0 1 1 00 0 1 0 10 1 0 1 1

∂2 =

00111

∂3 = 0

β0 = nullity(∂0)− rank(∂1) =

4− 3 = 1

β1 = nullity(∂1)− rank(∂2) =

2− 1 = 1

β2 = nullity(∂2)− rank(∂3) =

0− 0 = 0

Peter Bubenik Introduction to Topological Data Analysis

Page 13: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

7/37

Topology Statistics More details Homology Persistent homology

Exercise 3: Homology via linear algebra

A

B

C

DDimensions of vectors spacesof k-chains:

dim(C0) = 4

dim(C1) = 5

dim(C2) = 1

Boundary matrices:

∂0 = 0

∂1 =

1 1 0 0 01 0 1 1 00 0 1 0 10 1 0 1 1

∂2 =

00111

∂3 = 0

β0 = nullity(∂0)− rank(∂1) =

4− 3 = 1

β1 = nullity(∂1)− rank(∂2) =

2− 1 = 1

β2 = nullity(∂2)− rank(∂3) =

0− 0 = 0

Peter Bubenik Introduction to Topological Data Analysis

Page 14: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

7/37

Topology Statistics More details Homology Persistent homology

Exercise 3: Homology via linear algebra

A

B

C

DDimensions of vectors spacesof k-chains:

dim(C0) = 4

dim(C1) = 5

dim(C2) = 1

Boundary matrices:

∂0 = 0 ∂1 =

1 1 0 0 01 0 1 1 00 0 1 0 10 1 0 1 1

∂2 =

00111

∂3 = 0

β0 = nullity(∂0)− rank(∂1) =

4− 3 = 1

β1 = nullity(∂1)− rank(∂2) =

2− 1 = 1

β2 = nullity(∂2)− rank(∂3) =

0− 0 = 0

Peter Bubenik Introduction to Topological Data Analysis

Page 15: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

7/37

Topology Statistics More details Homology Persistent homology

Exercise 3: Homology via linear algebra

A

B

C

DDimensions of vectors spacesof k-chains:

dim(C0) = 4

dim(C1) = 5

dim(C2) = 1

Boundary matrices:

∂0 = 0 ∂1 =

1 1 0 0 01 0 1 1 00 0 1 0 10 1 0 1 1

∂2 =

00111

∂3 = 0

β0 = nullity(∂0)− rank(∂1) =

4− 3 = 1

β1 = nullity(∂1)− rank(∂2) =

2− 1 = 1

β2 = nullity(∂2)− rank(∂3) =

0− 0 = 0

Peter Bubenik Introduction to Topological Data Analysis

Page 16: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

7/37

Topology Statistics More details Homology Persistent homology

Exercise 3: Homology via linear algebra

A

B

C

DDimensions of vectors spacesof k-chains:

dim(C0) = 4

dim(C1) = 5

dim(C2) = 1

Boundary matrices:

∂0 = 0 ∂1 =

1 1 0 0 01 0 1 1 00 0 1 0 10 1 0 1 1

∂2 =

00111

∂3 = 0

β0 = nullity(∂0)− rank(∂1) = 4− 3 = 1

β1 = nullity(∂1)− rank(∂2) = 2− 1 = 1

β2 = nullity(∂2)− rank(∂3) = 0− 0 = 0

Peter Bubenik Introduction to Topological Data Analysis

Page 17: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

8/37

Topology Statistics More details Homology Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Introduction to Topological Data Analysis

Page 18: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

8/37

Topology Statistics More details Homology Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Introduction to Topological Data Analysis

Page 19: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

8/37

Topology Statistics More details Homology Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Introduction to Topological Data Analysis

Page 20: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

8/37

Topology Statistics More details Homology Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Introduction to Topological Data Analysis

Page 21: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

8/37

Topology Statistics More details Homology Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Introduction to Topological Data Analysis

Page 22: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

9/37

Topology Statistics More details Homology Persistent homology

Exercise 4: Constructing a Cech complex

Draw a picture of C 12({(0, 0), (0, 1), (1, 0), (1, 1)}).

(0, 0) (1, 0)

(0, 1) (1, 1)

Peter Bubenik Introduction to Topological Data Analysis

Page 23: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

9/37

Topology Statistics More details Homology Persistent homology

Exercise 4: Constructing a Cech complex

Draw a picture of C 12({(0, 0), (0, 1), (1, 0), (1, 1)}).

(0, 0) (1, 0)

(0, 1) (1, 1)

Peter Bubenik Introduction to Topological Data Analysis

Page 24: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

10/37

Topology Statistics More details Homology Persistent homology

The parameter

Question

What is the right value for the parameter in the Cech construction?

Often, there is no one “right” choice.

Peter Bubenik Introduction to Topological Data Analysis

Page 25: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

10/37

Topology Statistics More details Homology Persistent homology

The parameter

Question

What is the right value for the parameter in the Cech construction?

Often, there is no one “right” choice.

Peter Bubenik Introduction to Topological Data Analysis

Page 26: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

10/37

Topology Statistics More details Homology Persistent homology

The parameter

Question

What is the right value for the parameter in the Cech construction?

Often, there is no one “right” choice.

Peter Bubenik Introduction to Topological Data Analysis

Page 27: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

10/37

Topology Statistics More details Homology Persistent homology

The parameter

Question

What is the right value for the parameter in the Cech construction?

Often, there is no one “right” choice.

Peter Bubenik Introduction to Topological Data Analysis

Page 28: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

11/37

Topology Statistics More details Homology Persistent homology

Persistence

Main idea: persistence

Vary the parameter and keep track of when features appear anddisappear.

Varying the radii of the spheres in the Cech construction we get anincreasing family of simplicial complexes.

Peter Bubenik Introduction to Topological Data Analysis

Page 29: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 0

Peter Bubenik Introduction to Topological Data Analysis

Page 30: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 1

Peter Bubenik Introduction to Topological Data Analysis

Page 31: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 2

Peter Bubenik Introduction to Topological Data Analysis

Page 32: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 3

Peter Bubenik Introduction to Topological Data Analysis

Page 33: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 4

Peter Bubenik Introduction to Topological Data Analysis

Page 34: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 5

Peter Bubenik Introduction to Topological Data Analysis

Page 35: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 6

Peter Bubenik Introduction to Topological Data Analysis

Page 36: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 7

Peter Bubenik Introduction to Topological Data Analysis

Page 37: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 8

Peter Bubenik Introduction to Topological Data Analysis

Page 38: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 9

Peter Bubenik Introduction to Topological Data Analysis

Page 39: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 10

Peter Bubenik Introduction to Topological Data Analysis

Page 40: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

12/37

Topology Statistics More details Homology Persistent homology

Filtered simplicial complex from points in R2

radius = 11

Peter Bubenik Introduction to Topological Data Analysis

Page 41: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

13/37

Topology Statistics More details Homology Persistent homology

Mathematical encoding

We have an increasing sequence of simplicial complexes

X0 ⊆ X1 ⊆ X2 ⊆ · · · ⊆ Xm

called a filtered simplicial complex.

Apply homology.

We get a sequence of vector spaces and linear maps

V0 → V1 → V2 → · · · → Vm

called a persistence module.

Peter Bubenik Introduction to Topological Data Analysis

Page 42: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

14/37

Topology Statistics More details Homology Persistent homology

Graph of a persistence modules

V0 → V1 → V2 → V3 → V4 → V5 → V6 → V7 → · · · → Vm

Fundamental Theorem of Persistent Homology

There exists a choice of bases for the vector spaces Vi such thateach map is determined by a bipartite matching of basis vectors.

2 3 4 5 6 7 8 9 10 11 12

Peter Bubenik Introduction to Topological Data Analysis

Page 43: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

15/37

Topology Statistics More details Homology Persistent homology

Barcode from our points in R2

Straightening out the previous graph, we get a barcode.

2 3 4 5 6 7 8 9 10 11 12

Peter Bubenik Introduction to Topological Data Analysis

Page 44: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

16/37

Topology Statistics More details Homology Persistent homology

Persistence diagram from our points in R2

2 4 6 8 10 12

2

4

6

8

10

12

0birth

death

Peter Bubenik Introduction to Topological Data Analysis

Page 45: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

17/37

Topology Statistics More details Homology Persistent homology

Exercise 5: Barcodes and persistence diagrams

0

1

2

3

45

6

7

8 9

↪→ ↪→ ↪→ ↪→ ↪→

↪→ ↪→ ↪→ ↪→

Time 0 1 2 3 4 5 6 7 8 9

Betti number β0

β0 β0 β0 β0 β0 β0 β1 β1 β1

effect +

+ + − + − − + + −

Birth–Death pairs for H0:

(0,∞), (1, 3), (2, 6), (4, 5)

Birth–Death pairs for H1:

(7,∞), (8, 9)

Peter Bubenik Introduction to Topological Data Analysis

Page 46: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

17/37

Topology Statistics More details Homology Persistent homology

Exercise 5: Barcodes and persistence diagrams

0

1

2

3

45

6

7

8 9

↪→ ↪→ ↪→ ↪→ ↪→

↪→ ↪→ ↪→ ↪→

Time 0 1 2 3 4 5 6 7 8 9

Betti number β0 β0 β0 β0 β0 β0 β0 β1 β1 β1effect + + + − + − − + + −

Birth–Death pairs for H0:

(0,∞), (1, 3), (2, 6), (4, 5)

Birth–Death pairs for H1:

(7,∞), (8, 9)

Peter Bubenik Introduction to Topological Data Analysis

Page 47: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

17/37

Topology Statistics More details Homology Persistent homology

Exercise 5: Barcodes and persistence diagrams

0

1

2

3

45

6

7

8 9

↪→ ↪→ ↪→ ↪→ ↪→

↪→ ↪→ ↪→ ↪→

Time 0 1 2 3 4 5 6 7 8 9

Betti number β0 β0 β0 β0 β0 β0 β0 β1 β1 β1effect + + + − + − − + + −

Birth–Death pairs for H0: (0,∞), (1, 3), (2, 6), (4, 5)Birth–Death pairs for H1: (7,∞), (8, 9)

Peter Bubenik Introduction to Topological Data Analysis

Page 48: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

18/37

Topology Statistics More details Homology Persistent homology

Exercise 5: Barcodes and persistence diagrams

0

1

2

3

45

6

7

8 9

Birth–Death pairs for H0: (0,∞), (1, 3), (2, 6), (4, 5)Birth–Death pairs for H1: (7,∞), (8, 9)

Barcode

0 2 4 6 8 10 12

H0

H1

Persistence diagram

2 4 6 8 10

2

4

6

8

10

0

birth

dea

th

Peter Bubenik Introduction to Topological Data Analysis

Page 49: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

18/37

Topology Statistics More details Homology Persistent homology

Exercise 5: Barcodes and persistence diagrams

0

1

2

3

45

6

7

8 9

Birth–Death pairs for H0: (0,∞), (1, 3), (2, 6), (4, 5)Birth–Death pairs for H1: (7,∞), (8, 9)

Barcode

0 2 4 6 8 10 12

H0

H1

Persistence diagram

2 4 6 8 10

2

4

6

8

10

0

birth

dea

th

Peter Bubenik Introduction to Topological Data Analysis

Page 50: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

19/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Statistical viewpoint

The barcode/persistence diagram is a random variable;it is a summary statistic.

UnderlyingProbability

Space

Space ofTopologicalSummaries

Peter Bubenik Introduction to Topological Data Analysis

Page 51: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

20/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Challenges

TopologicalSummary

Statisticsand

MachineLearning

For example:

calculate averages

understand variances

test hypotheses

cluster and classify

Peter Bubenik Introduction to Topological Data Analysis

Page 52: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

21/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Statistics with barcodes/persistence diagrams

Set ofbarcodes

StatisticsMetric

Easy:

clustering

certain hypothesis tests

Hard:

calculating averages

understanding variances

classification

Peter Bubenik Introduction to Topological Data Analysis

Page 53: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

22/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Making life easier

Barcodespace

Vectorspace

One way to turn a barcode or persistence diagram into a vector isthe persistence landscape.

Advantages:

it does not lose information

it is stable

it has a discrete and a continuous version

Peter Bubenik Introduction to Topological Data Analysis

Page 54: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

23/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Persistence landscape from a barcode

Replace

−1 0 1 2 3 4 5 6 7

with

−1 1 2 3 4 5 6 7

−1

1

2

3

0

Peter Bubenik Introduction to Topological Data Analysis

Page 55: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

24/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Persistence landscape from a barcode

Barcode:

0 2 4 6 8 10 12 14

Persistence Landscape:

2 4 6 8 10 12 14

2

4

6

0

λ1

λ2

λ3

λk = 0,

for k ≥ 4

Peter Bubenik Introduction to Topological Data Analysis

Page 56: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

25/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Persistence landscape from a persistence diagram

2 4 6 8 10 12

2

4

6

8

10

12

0birth

death

1

1

2

23

0

Peter Bubenik Introduction to Topological Data Analysis

Page 57: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

25/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Persistence landscape from a persistence diagram

2 4 6 8 10 12

2

0

1 12

2 3

0

2 4 6 8 10 12

2

0

λ1 λ2

λ3

Peter Bubenik Introduction to Topological Data Analysis

Page 58: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

25/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Persistence landscape from a persistence diagram

Peter Bubenik Introduction to Topological Data Analysis

Page 59: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

26/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Exercise 6: Graphing the persistence landscape

0

1

2

3

45

6

7

8 9Birth–Death pairs for H0:(1, 3), (2, 6), (4, 5)

Graph the corresponding persistence landscape.

1 2 3 4 5 6

1

2

0

λ1λ2

λk = 0,

for k ≥ 3

Peter Bubenik Introduction to Topological Data Analysis

Page 60: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

26/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Exercise 6: Graphing the persistence landscape

0

1

2

3

45

6

7

8 9Birth–Death pairs for H0:(1, 3), (2, 6), (4, 5)

Graph the corresponding persistence landscape.

1 2 3 4 5 6

1

2

0

λ1λ2

λk = 0,

for k ≥ 3

Peter Bubenik Introduction to Topological Data Analysis

Page 61: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

27/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Making life easier

Barcodespace

Vectorspace

Persistence Landscape

Choices for the vector space

continuous version: L2(R2)

discrete version: Rn

What is great about Rn and L2(R2)?

are vector spaces (easy to measure distances, averages)

have inner products (easy to measure angles)

are complete (good for studying convergence)

Thus we can

apply tools from probability, statistics and machine learning

Peter Bubenik Introduction to Topological Data Analysis

Page 62: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

27/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Making life easier

Barcodespace

Vectorspace

Persistence Landscape

Choices for the vector space

continuous version: L2(R2)

discrete version: Rn

What is great about Rn and L2(R2)?

are vector spaces (easy to measure distances, averages)

have inner products (easy to measure angles)

are complete (good for studying convergence)

Thus we can

apply tools from probability, statistics and machine learning

Peter Bubenik Introduction to Topological Data Analysis

Page 63: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

28/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Topological hypothesis testing

Peter Bubenik Introduction to Topological Data Analysis

Page 64: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

28/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Topological hypothesis testing

Points→ kernel density estimator→ filtered simplicial complex

Peter Bubenik Introduction to Topological Data Analysis

Page 65: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

28/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Topological hypothesis testing

Peter Bubenik Introduction to Topological Data Analysis

Page 66: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

28/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Topological hypothesis testing

Null hypothesis: ‖λS‖1 = ‖λT‖1.

two-sample z-test:

degree decision p value

0 cannot reject1 reject 3× 10−6

2 cannot reject

Peter Bubenik Introduction to Topological Data Analysis

Page 67: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

29/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Topological hypothesis testing, noisy

Peter Bubenik Introduction to Topological Data Analysis

Page 68: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

29/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Topological hypothesis testing, noisy

Peter Bubenik Introduction to Topological Data Analysis

Page 69: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

29/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Topological hypothesis testing, noisy

Null hypothesis: ‖λS − λT‖2 = 0.

Permutation test:

dim decision p value

0 reject 0.01111 reject 0.00002 reject 0.0000

Peter Bubenik Introduction to Topological Data Analysis

Page 70: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

29/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Topological hypothesis testing, noisy

Peter Bubenik Introduction to Topological Data Analysis

Page 71: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

30/37

Topology Statistics More details Persistence Landscape Hypothesis testing

Software

Persistent Homology:

CHOMP, Dionysus, DIPHA, Eirene, GUDHI, JavaPlex,Perseus, PHAT, Ripser, SimBa, SimPers

Persistence Landscape:

The Persistence Landscape Toolbox

Topological Data Analysis:

the R package TDA

my R code

Peter Bubenik Introduction to Topological Data Analysis

Page 72: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

31/37

Topology Statistics More details Stability Average Variance

Stability

Given f : X → R,let λ(f ) the persistence landscape of sublevel sets of f .

Landscape Stability Theorem (B)

Let f , g : X → R.

‖λ(f )− λ(g)‖∞ ≤ ‖f − g‖∞.

If X is nice and f and g are tame and Lipschitz then

‖λ(f )− λ(g)‖22 ≤ C‖f − g‖2−k∞ .

Peter Bubenik Introduction to Topological Data Analysis

Page 73: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

32/37

Topology Statistics More details Stability Average Variance

Average landscapes

Persistence landscapes, λ(1), . . . , λ(n), have a pointwise average,

λ(k , t) =1

n

n∑i=1

λ(i)(k , t)

Peter Bubenik Introduction to Topological Data Analysis

Page 74: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

33/37

Topology Statistics More details Stability Average Variance

Average diagram vs average landscape

2 4 6 8 10 12 14 16

2

4

6

8

10

0 2 4 6 8 10 12 14 16

2

4

6

8

10

0

λ1

λ2

2 4 6 8 10 12 14 16

2

4

6

8

10

0

λ1

λ2

2 4 6 8 10 12 14 16

2

4

6

8

10

0

λ1

λ2

Peter Bubenik Introduction to Topological Data Analysis

Page 75: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

34/37

Topology Statistics More details Stability Average Variance

Average landscapes for Gaussian random fields

Peter Bubenik Introduction to Topological Data Analysis

Page 76: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

35/37

Topology Statistics More details Stability Average Variance

Average landscapes for Gaussian random fields

Peter Bubenik Introduction to Topological Data Analysis

Page 77: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

36/37

Topology Statistics More details Stability Average Variance

Asymptotics for persistence landscapes

λ is a random variable in L2(R2), ‖λ‖ is a real random variable.

If E‖λ‖ <∞ then there exists E (λ) ∈ L2(R2) such thatE (f (λ)) = f (E (λ)) for all continuous linear functionals f .

Strong Law of Large Numbers (B, 2015)

λ(n) → E (λ) almost surely

Central Limit Theorem (B, 2015)

√n[λ

(n) − E (λ)] converges weakly to a Gaussian random variable

Peter Bubenik Introduction to Topological Data Analysis

Page 78: Topology for Data Science 1: An Introduction to ... · 1/37 TopologyStatisticsMore details Topology for Data Science 1: An Introduction to Topological Data Analysis Peter Bubenik

37/37

Topology Statistics More details Stability Average Variance

Understanding variance

Two approaches:

Bootstrap and confidence intervals for persistence landscapes[Chazal, Fasy, Lecci, Rinaldo, Singh, Wasserman]

Figure 4: Top Left: Sample space of epicenters of 8000 earthquakes. Bottom Left: one of the 30 persistencediagrams. Middle: uniform and adaptive 95% confidence bands for the mean landscape µ(t). Right:uniform and adaptive 95% confidence bands for the mean weighted silhouette E[¡(0.01)(t)].

repeat this procedure n = 30 times and compute the mean landscape ∏n. Using the algorithm givenin Algorithm 1, we obtain the uniform 95% confidence band of Theorem 3 and the adaptive 95% con-fidence band of Theorem 4. See Figure 4 (middle). Both the confidence bands have coverage around95% for the mean landscape µ(t) that is attached to the distribution induced by the sampling scheme.Similarly, using the same n = 30 persistence diagrams we construct the corresponding weighted sil-houettes using p = 0.01 and construct uniform and adaptive 95% confidence bands for the meanweighted silhouette E[¡(0.01)(t)]. See Figure 4 (right). Notice that, for most t 2 [0,T], the adaptiveconfidence band is tighter than the fixed-width confidence band.

6.2 Toy Example: RingsIn this example, we embed the torus S1 £S1 in R3 and we use the rejection sampling algorithm ofDiaconis et al. (2012) (R = 5, r = 1.8) to sample 10,000 points uniformly from the torus. Then we linkit with a circle of radius 5, from which we sample 1,800 points; see Figure 5 (top left). These N =11,800 points constitute the sample space. We randomly sample m = 600 of these points, constructthe Vietoris-Rips filtration, compute the persistence diagram (Betti 1) and the corresponding firstand third landscapes and the silhouettes for p = 0.1 and p = 4. We repeat this procedure n = 30times to construct 95% adaptive confidence bands for the mean landscapes µ1(t), µ3(t) and the mean

10

Figure 5: Top Left: The sample space. Bottom Left: one of the 30 persistence diagrams. Middle: adaptive95% confidence bands for the mean first landscape µ1(t) and mean third landscape µ3(t). Right: adaptive95% confidence bands for the mean weighted silhouettes E[¡(4)(t)] and E[¡(0.1)(t)].

silhouettes E[¡(4)(t)], E[¡(0.1)(t)]. Figure 5 (bottom left) shows one of the 30 persistence diagrams.In the persistence diagram, notice that three persistence pairs are more persistent than the rest.These correspond to the two nontrivial cycles of the torus and the cycle corresponding to the circle.We notice that many of the points in the persistence diagram are hidden by the first landscape.However, as shown in the figure, the third landscape function and the silhouette with parameterp = 0.1 are able to detect the presence of these features.

7 DiscussionWe have shown how the bootstrap can be used to give confidence bands for Bubeknik’s persistencelandscape and for the persistence silhouette defined in this paper. We are currently working onseveral extensions to our work including the following: allowing persistence diagrams with countablymany points, allowing T to be unbounded, and extending our results to new functional summaries ofpersistence diagrams. In the case of subsampling (scenario 2 defined in the introduction), we haveprovided accurate inferences for the mean function µ. We are investigating methods to estimate thedifference between µ (the mean landscape from subsampling) and ∏ (the landscape from the originallarge dataset). Coupled with our confidence bands for µ, this could provide an efficient approach toapproximating the persistent homology in cases where exact computations are prohibitive.

11

Principal component analysis (coming in Talk 2)

Peter Bubenik Introduction to Topological Data Analysis


Recommended