1/16
Topology Proteins Software Summary
Topological Data Analysis
Peter Bubenik
University of FloridaDepartment of Mathematics,[email protected]
http://people.clas.ufl.edu/peterbubenik/
British Applied Mathematics ColloquiumUniversity of Oxford
April 6, 2016
Peter Bubenik Topological Data Analysis
2/16
Topology Proteins Software Summary Basics Persistent homology
Topological Data Analysis
Idea
Use topology to summarize and learn from the “shape” of data.
Peter Bubenik Topological Data Analysis
3/16
Topology Proteins Software Summary Basics Persistent homology
Simplicial complexes from point data
The Cech construction
Peter Bubenik Topological Data Analysis
3/16
Topology Proteins Software Summary Basics Persistent homology
Simplicial complexes from point data
The Cech construction
Peter Bubenik Topological Data Analysis
3/16
Topology Proteins Software Summary Basics Persistent homology
Simplicial complexes from point data
The Cech construction
Peter Bubenik Topological Data Analysis
3/16
Topology Proteins Software Summary Basics Persistent homology
Simplicial complexes from point data
The Cech construction
Peter Bubenik Topological Data Analysis
3/16
Topology Proteins Software Summary Basics Persistent homology
Simplicial complexes from point data
The Cech construction
Peter Bubenik Topological Data Analysis
4/16
Topology Proteins Software Summary Basics Persistent homology
Homology of simplicial complexes
Definition
Homology in degree k is given by k-cycles modulo thek-boundaries.
Peter Bubenik Topological Data Analysis
4/16
Topology Proteins Software Summary Basics Persistent homology
Homology of simplicial complexes
Definition
Homology in degree k is given by k-cycles modulo thek-boundaries.
Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 0Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 1Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 2Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 3Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 4Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 5Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 6Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 7Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 8Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 9Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 10Peter Bubenik Topological Data Analysis
5/16
Topology Proteins Software Summary Basics Persistent homology
Persistence
Main idea
Vary a parameter and keep track of when features appear anddisappear.
radius = 11Peter Bubenik Topological Data Analysis
6/16
Topology Proteins Software Summary Basics Persistent homology
Mathematical encoding
We have an increasing sequence of simplicial complexes
X0 ⊆ X1 ⊆ X2 ⊆ · · · ⊆ Xm
called a filtered simplicial complex.
Apply homology.
We get a sequence of vector spaces and linear maps
V0 → V1 → V2 → · · · → Vm
called a persistence module.
Peter Bubenik Topological Data Analysis
7/16
Topology Proteins Software Summary Basics Persistent homology
Persistence module to Barcode
V0 → V1 → V2 → V3 → V4 → V5 → V6 → V7 → · · · → Vm
Fundamental Theorem of Persistent Homology
There exists a choice of bases for the vector spaces Vi such thateach map is determined by a bipartite matching of basis vectors.
Get a barcode:
2 3 4 5 6 7 8 9 10 11 12
Peter Bubenik Topological Data Analysis
8/16
Topology Proteins Software Summary Basics Persistent homology
Barcode to Persistence Landscape
Barcode:
0 2 4 6 8 10 12 14
Persistence Landscape:
2 4 6 8 10 12 14
2
4
6
0
λ1
λ2
λ3
λk = 0,
for k ≥ 4
Peter Bubenik Topological Data Analysis
9/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Maltose Binding Protein, two conformations
V. Kovacev-Nikolic, P. Bubenik, D. Nikolic, and G. Heo. Using persistenthomology and dynamical distances to analyze protein binding. StatisticalApplications in Genetics and Molecular Biology, 15 (2016) no. 1, 19–38.
Peter Bubenik Topological Data Analysis
9/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Maltose Binding Protein, two conformations
V. Kovacev-Nikolic, P. Bubenik, D. Nikolic, and G. Heo. Using persistenthomology and dynamical distances to analyze protein binding. StatisticalApplications in Genetics and Molecular Biology, 15 (2016) no. 1, 19–38.
Peter Bubenik Topological Data Analysis
10/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Maltose Binding Protein Data
The Data
Fourteen MBP structures from the Protein Data Bank.
7 closed conformations
7 open conformations
X-ray crystallography: coordinates of atoms.
Represent each amino acid residue by its Cα atom.
Have 14 sets of 370 points in R3.
The Goal
Can we use topological data analysis to distinguish the open andclosed conformations?
Peter Bubenik Topological Data Analysis
11/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Filtered simplicial complex
Peter Bubenik Topological Data Analysis
11/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Filtered simplicial complex
Peter Bubenik Topological Data Analysis
11/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Filtered simplicial complex
Peter Bubenik Topological Data Analysis
11/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Filtered simplicial complex
Peter Bubenik Topological Data Analysis
11/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Filtered simplicial complex
Peter Bubenik Topological Data Analysis
12/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Average persistence landscapes
H1 closed
0.70.60.50.40.30.20.180
60
40
0.08
0.06
0.1
0.04
0.02
0
0.12
0.14
20
H1 open
0.550.50.450.40.350.30.250.20.150.10.05
60
40
20
0.06
0.04
0.02
0
0.12
0.14
0.16
0.1
0.08
Peter Bubenik Topological Data Analysis
13/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Clustering of protein conformations
H1
0
8
14
11 10 13912
1
6354
2 7
1
12 1113
10 148
9
5
6 3472
12 14
765
23
14 811
13910
11
3
13 1214
89
10
45
6 71
2
L1-no
Figure 1: Distance analysis based on the 2-Landscape distance shows a separation between theclosed (blue) and the open (red) MBP conformation for degree 0 (left) and degree 1 (right) persistenthomology. Similar results hold for degree 2. The projection of the 14 × 14 distance matrix ontothe 3D space is attained via Isomap.
4.3 Statistical Inference
To measure the statistical significance of visually observed differences between the closed
and the open conformation we use a permutation test. For each degree, we calculate
fourteen sample values of the random variable X from Equation (3). The permutation
test carried out at the significance level of 0.05 yields a p-value of 5.83 × 10−4 for both
homology in degree 0 and in degree 1. We obtain the same p-value since in both cases the
observed statistic was the most extreme among all 1716 possible permutations. Hence, at
14
Projection of the L2 distance matrix to R3 using Isomap.
Peter Bubenik Topological Data Analysis
14/16
Topology Proteins Software Summary Topological Data Analysis Machine Learning
Classification of protein conformations
H1
x
y
z
x
y
z
H0 H2
z
xy
- closed - open - support vector
SVM on Isomap embedded 3D coordinates in the metric space induced by the 2-Landscape distance
H1
x
y
z
x
y
z
H0 H2
z
xy
- closed - open - support vector
SVM on Isomap embedded 3D coordinates in the metric space induced by the 2-Landscape distance
Peter Bubenik Topological Data Analysis
15/16
Topology Proteins Software Summary
Software
Persistent Homology software:
JavaPlex
PHAT, DIPHA
Perseus
Dionysus
CHOMP
GUDHI
Persistence Landscape software:
The Persistence Landscape Toolbox
the R package TDA
Peter Bubenik Topological Data Analysis
16/16
Topology Proteins Software Summary
Topological Data Analysis Summary
Raw Data Clean data
Filteredsimplicialcomplex
Persistencemodule
Topologicalsummary
Statisticsand
MachineLearning
Preprocess
Transform
Homology
Peter Bubenik Topological Data Analysis