Persistence landscapes toolbox– a tool for topological ... · Persistence landscapes toolbox– a...

Post on 19-Jul-2020

2 views 0 download

transcript

Persistence landscapes toolbox– a tool fortopological statistics.

Paweł Dłotko

Geometrica Group, Inria Saclay

IST-Austria,

7 July 2015.

Possible characterization of data.

Space of data

8

High dimension

1

2 Scalar characteristics

of data

Vorticity/dynamics in turbulent flow.

Dynamics of a flow.

Average enstrophy

Disctetization

To discover.

Ayasdi (M. Nicolau et al.).

Why people like low dimensions?

1. Because they can see them.

2. Because they can understand them.

3. Because they can use standard statistics for the obtainedobservations.

4. Because there are no tools to operate on higher dimensionaldata.

Persistence comes into play.

Space of data

8

High dimension Space of

persistence diagrams

1

2 Scalar characteristics

of data

Sta

ble

pro

ject

ion

Persistence, state of the art.

1. Persistent homology is a dimension reduction technique.

2. There are standard metrics to compare persistence diagrams.

3. Early attempts to define Frechet mean of a diagram.

Problem with Frechet mean.

Problem with Frechet mean.

Problem with Frechet mean.

Problem with Frechet mean.

Persistence, what people are doing.

Space of data

8

High dimension Space of

persistence diagrams

1

2 Scalar characteristics

of data

Sta

ble

pro

ject

ion

Persistence, what is worth to consider.

Space of data

8

High dimension Space of

persistence diagrams

1

2 Scalar characteristics

of data

Sta

ble

pro

ject

ion

Lifting persistence to larger space.

1. Persistence diagrams as distributions:1.1 Frechet means are unique, but hard to compute (K. Turner, Y.

Mileyko, S. Mukherjee, J. Harer).1.2 For image classification (J. Reininghaus, S. Huber, U. Bauer,

R. Kwitt).

2. Persistence landscapes (P. Bubenik) / Size functions (M.Ferri, P. Frosini, C. Landi, B. di Fabio, A. Cerri, ...).

Persistence landscapes.

Persistence landscapes.

Persistence landscapes.

Persistence landscapes.

Persistence landscapes.

Persistence landscapes.

Persistence landscapes λ1.

Persistence landscapes λ2.

Persistence landscapes λ3.

Formal definition.1. The persistence landscape of a multiset of persistence

barcodes {(bi , di )}ni=1 is a set of functions λk : R → R suchthat λk(x) = k-th largest value of {f (bi , di )(x)}ni=1, where

2.

f(b,d) =

0 if x 6∈ (b, d)

x − b if x ∈ (b, b+d2 ]

−x + d if x ∈ (b+d2 , d)

(1)

Landscapes as size function.

1 12

a b

Persistence landscapes.

1. 1− 1 representation of persistence.

2. Vector space operations on functions +,−, multiplication byscalar well defined.

3. Average of two functions f , g in function space is just f+g2 .

4. Standard Lp norms and distances well defined. Landscapes arestable with respect to those norms.

5. PL-functions → easy to compute.

6. Represented as a set of critical points. In between them, oneneed to use linear approximation.

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

How to compute a persistence landscapes?

Swapping line approach (M. Kerber)

Time complexity of construction.

1. In both cases, O(nlogn + Kn), where K is the number ofnonzero landscapes.

2. Pessimistically, this is O(n2).

3. We believe that the swapping line algorithm may performbetter in practice. Tests underway.

4. Over here we are talking about exact computations.R-package TDA do that over a grid (CMU TopStat Group).

Lp distance, sum of distances between levels.

Averages.

Sum.

12Sum.

Averages (in a grid).1. Note that landscape of average do not come from persistence

diagram.2. The slope of averages is between −1 and 1.3. Typically it is much lower.4. Error bounds over a grid may be heavily overestimated.5. Exact representation gives more flexibility.

0

20

40

60

80

100

0 50 100 150 200 250

Permutation tests.

152679

212527232229

152679

Permutation tests.

152679

212527232229

5

24.5

av

av

152679

Permutation tests.

152679

212527232229

5

24.5

av

av

152679

|5-24.5|=19.5

Permutation tests.

152679

212527232229

5

24.5

av

av

152679212527232229

152679

|5-24.5|=19.5merge

Permutation tests.

152679

212527232229

5

24.5

av

av

152679212527232229

212372762952512229

shuffle

152679

|5-24.5|=19.5merge

Permutation tests.

152679

212527232229

5

24.5

av

av

152679212527232229

212372762952512229

shuffle

152679

|5-24.5|=19.5

2123727629

52512229

dividemerge

Permutation tests.

152679

212527232229

5

24.5

av

av

152679212527232229

212372762952512229

shuffle

152679

|5-24.5|=19.5

2123727629

52512229

av

av

divide

18.83

10.6

|18.83-10.6|=8.23merge

Permutation tests.

152679

212527232229

5

24.5

av

av

152679212527232229

212372762952512229

shuffle

152679

|5-24.5|=19.5

2123727629

52512229

av

av

divide

18.83

10.6

|18.83-10.6|=8.23merge

Simple classifiers.

Persistence 1Persistence 2

…Persistence n

Group 1

Persistence 1Persistence 2

…Persistence n

Group 2

Persistence 1Persistence 2

…Persistence n

Group k

...

Simple classifiers.

Persistence 1Persistence 2

…Persistence n

Group 1

Persistence 1Persistence 2

…Persistence n

Group 2

Persistence 1Persistence 2

…Persistence n

Group k

...

Av(group 1) Av(group 2) Av(group n)

Simple classifiers.

Persistence 1Persistence 2

…Persistence n

Group 1

Persistence 1Persistence 2

…Persistence n

Group 2

Persistence 1Persistence 2

…Persistence n

Group k

...

Av(group 1) Av(group 2) Av(group n)

New persistence

diagram

DistanceDistance

Distance

End-user programs to compute various statistics onPersistence landscapes.

1. Computations of distance matrix.

2. Computation of averages landscapes.

3. Standard deviation.

4. Inner products of landscapes.

5. Computations of integrals.

6. Moments computations.

7. Permutation test.

8. T-test, anova.

9. Classifiers.

10. Normalization of barcodes.

11. Plots.

Why to bother with L1 distances when there is1−Wasserstein?

0.001

0.01

0.1

1

10

100

1000

10000

100 200 300 400 500 600 700 800

W1L1

Why to bother with L2 distances when there is2−Wasserstein?

0.001

0.01

0.1

1

10

100

1000

10000

100 200 300 400 500 600 700 800

W2L2

Why to bother with L2 distances when there is2−Wasserstein?

0.001

0.01

0.1

1

10

100 200 300 400 500 600 700 800

BottleneckL infty

Persistence, what is worth to consider.

Space of data

8

High dimension Space of

persistence diagrams

1

2 Scalar characteristics

of data

Sta

ble

pro

ject

ion

Latest idea.

1. Suppose we have snapshots of a process S1, . . . ,Sn.

2. They gives a collection D1, . . . ,Dn of diagrams.

3. Processes are characterized by real numbers G1, . . . ,Gn.

4. Can we recover G1, . . . ,Gn from D1, . . . ,Dn?

5. Is there a function f such that f (D1), . . . , f (Dn) correlateswith G1, . . . ,Gn?

6. In this case, scientist is usually trying to find f by looking atdiagrams.

7. Which is frustrating and time consuming.

8. The new algorithm in PLT use brute force for finding afunction of the diagram that correlate most with the data.

Latest idea.

ProcessS1,...,Sn

Persistence diagramsD1,...,Dn

persistence

cha

racte

ris tics

f

G1<...<Gn

f

Latest idea, more details.

1. Suppose that D1, . . . ,Dn are sorted according to G1, . . . ,Gn.

2. Φ – collection of scalar valued functions defined on a diagram.

3. For every f ∈ Φ, compute f (D1), . . . , f (Dn).

4. Compute the Kendall tau distance of the sequence[f (D1), . . . , f (Dn)] from the sorted version of[f (D1), . . . , f (Dn)].

5. Best function in Φ – the one with smallest tau distance.

Kendall tau distance.

4 3 2 14 3 1 24 1 3 21 4 3 21 4 2 31 2 4 31 2 3 4

1. Six transpositions are needed.

2. Number of all possible transpositions: 4 32 = 6.

3. Kendall tau distance is defined as number of transpositionsneeded divided by number of transpositions needed to reversea sequence.

Latest idea, a variation.

Collection 1D1,...,Dn

Collection 2E1,...,Ek

f?

R

Applications overview.

1. Patterns from numerical analysis (Cahn-Hiliard-Cook,Diblock-Copolymer equations).

2. Dimensions of spheres.

3. Efficient distance matrix computations (granular mediaanalysis).

Classification example – Cahn Hilliard Cook patterns.

Topological process.1. Sequence of time varying data, each of which has its own

persistence.2. Distances are computed between the corresponding time steps.3. Averaging over time steps (to get averaged topological

process).

time

Process #

Topological classifier.

Dynamics f(proportion of mass, time)

Patterns

Persistent homology of patterns

p1

p2

Sim

ple

nea

rest

nei

g hbo

r c l

ass i

fier

Dimensions of spheres.

1. This is a proof-of-concept experiment.

2. Suppose we are given a collection of point clouds sampledfrom (round) Sd for d ∈ {1, . . . , 9}.

3. Suppose that the point clouds were normalized so thataverage distance between points is 1.

4. Then, zero and one dimensional persistence identify thedimension (permutation test).

Average landscapes in dimension 1.

Granular media.1. Granular media – large conglomerations of discrete

macroscopic particles.2. Behaves differently from solids, liquids, or gases.3. Konstantin Mischaikow and Miro Kramar are using

persistence to characterize force networks inside media.4. Lot of distances needs to be computed.

More to come soon.

I hope...

How to obtain?

1. Go to http://www.math.upenn.edu/~dlotko/persistenceLandscape.html.

2. Linux, windows and osx exacutables which can perform mosttypical tasks are provided.

3. Source code (still a bit messy) for advanced users. A lot ofcomments are provided in the code.

4. The R-package TDA can be obtained from here: http://cran.r-project.org/web/packages/TDA/index.html.

5. Will be happy to do a code demonstration for you!

Joint work with:

Peter Bubenik, Takashi Ishihara, Michael Kerber, KonstantinMischaikow, Thomas Wanner.

Thank you for your time!

Pawel DlotkoInria, Saclay, pawel.dlotko@inria.fr

pawel dlotko @ skypepdlotko @ gmail

Let’s check out the library!1. Dataset: Let us sample 11 times 50n points from wedge of

n−circles iid with some error.2. Compute Rips complex and persistence of each of the point

clouds.

How to obtain?

1. Go to http://www.math.upenn.edu/~dlotko/persistenceLandscape.html.

2. Linux, windows and osx exacutables which can perform mosttypical tasks are provided.

3. Source code (still a bit messy) for advanced users. A lot ofcomments are provided in the code.

4. The R-package TDA can be obtained from here: http://cran.r-project.org/web/packages/TDA/index.html.

What do you need first?

1. You need a persistence intervals in a form of a file:1 24 59 22

2. They can be obtained with various programs to computepersistent homology.

3. Dyinizous, JPlex, Perserus, Phat, Plex.

Distance matrix

1. Go to http://www.math.upenn.edu/~dlotko/persistenceLandscape.html.

2. Linux, windows and osx exacutables which can perform mosttypical tasks are provided.

3. Source code (still a bit messy) for advanced users. A lot ofcomments are provided in the code.

4. Construct to files with paths to the barcodes.

5. Call DistanceMatrix program.

Others...

1. Let us try standard deviation (StandardDeviation),

2. Permutation test (PermutationTest),

3. Computations of averages (ComputeAverage),

4. Ploting subroutines (PlotsOfLandscapesViaScripts).

5. Classification (in dimension 1)(ClassifierBasedOnSingleDimension).