Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | mitchell-jayson-wilkerson |
View: | 213 times |
Download: | 0 times |
Storytelling and Clustering for Cellular Signaling Pathways
M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys
Department of Computer Science,Virginia Tech, Blacksburg, VA 24061.
2
Objective
STKE Dataset Cell interactions through chemical
signals Discover relationships between the
pathways Graph structure Subgraph discovery problem
Pathways relationships Clustering Storytelling
Myocyte Adrenergic Pathway (CMP_9043)
4
Dataset properties
Total Pathways = 50
Size Range
1-1
0
11
-20
21
-30
31
-40
41
-50
51
-60
61
-70
71
-80
81
-90
91
-10
0
10
0-1
10N
um
ber
of
Pat
hw
ays
in S
ize
Ran
ge
0
2
4
6
8
10
12
5
Design Pipeline
Preprocessor
Frequent Subgraph Discovery
Pathway Graphs
Frequent Subgraph
s
Clustering
STKE Dataset
NN Storytelling
6
Subsequent Candidate Generation Apriori – incremental approach [17] FSG [2]
Generate a (k+1)-edge candidate subgraph by combining two k-edge subgraphs where these two k-edge subgraphs have a common core subgraph of (k-1)-edges.
Cost of comparison between subgraphs (and core subgraphs) is reduced using hash-code of each subgraph object.
m
n
o
lp
m
n
o
pq l
m
n
o
pq
7
Subsequent Candidate Generation
Instance: Number of 5-edge
subgraphs: 21 Core subgraph
comparisons for s1: 20
mn
o
l p q
mn
o
p l q
mn
o
p
mn
o
l p
m op
r
nm o
lp
r
n
mn
o
l pm
n
o
l ps
mn
o
ps
mn
o
l p m
n
o
t zNot generated
………………………………………….………………………………................………………………………………….
Total Unique Nodes:1205Total Relations:1376
Master Pathway Graph (MPG)
9
SEG - Subgraph Extension Generation
Neighborhood Extension Neighborhood list : {q, r, s}
Comparison is not required. Subgraph is extended from
physical evidence
m
n
o
lp
n
m o
lps
m
n
o
lp
q
m
n
o
lp
r
l
m n
o
q
p
r
s
10
Design Pipeline
Preprocessor
Frequent Subgraph Discovery
Pathway Graphs
Frequent Subgraph
s
Clustering
STKE Dataset
NN Storytelling
11
Subgraph Discovery
k # of Subgraphs generated
Time (sec.)
1 1,376 Existing
2 5,380 41
3 29,565 149
4 187,508 971
5 1274,852 7518
--- ---- -----
min_sup=2%
• What so novel about pruning edges?
12
‘Importance Factor’ of a subgraph: sfipf
jj n
sf1
jij
ipsp
Dipf
:
Subgraph frequency,
Inverse pathway frequency,
ijji ipfsfsfipf ,
For i-th subgraph j-th pathway:
13
Dataset Properties (sfipf)
min_sfipf
0.0
00
.02
0.0
40
.06
0.0
80
.10
0.1
20
.14
0.1
60
.18
0.2
0
# o
f e
dg
es
le
ft
0200400600800
100012001400
min_sfipf
0.0
00
.02
0.0
40
.06
0.0
80
.10
0.1
20
.14
0.1
60
.18
0.2
0
# o
f p
ath
wa
ys l
eft
0
10
20
30
40
50
Number of edges in MPG=1376Total pathways=50
14
Subgraph Discovery
min_sup= 4.0%min_sfipf= 0.01
k
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Tim
e (m
s)
0
50x103
100x103
150x103
200x103
250x103
300x103
350x103
400x103
FSGSEG
15
Subgraph Discovery
min_sup= 4.0%min_sfipf= 0.01
k
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Tim
e (m
s)
0
500
1000
1500
2000
2500
3000
FSGSEG
16
Subgraph Discovery
min_sup= 4.0%min_sfipf= 0.01
k
3 4 5 6 7 8 9 10
11
12
13
14
15
16
17
18
19
20
21
# o
f A
tem
pts
0
250000
500000
750000
1000000
1250000
FSGSEG
k Number of
Subgraphs
Time Saved
(%)
Attempts
Saved(%)
2 186 99.83 98.983 246 98.33 86.154 305 98.57 86.385 323 98.95 86.916 313 98.96 85.647 279 98.88 83.258 263 98.67 78.919 292 98.38 74.76
10 364 98.58 74.7511 470 98.76 78.0812 608 99.04 81.8413 785 99.22 85.0214 980 99.38 87.6315 1117 99.48 89.4816 1075 99.53 90.2617 804 99.51 89.4018 430 99.34 85.2219 141 98.76 71.2220 20 96.15 9.1921 1 75.74 -574.47Overall attempts saved = 89.52%
Overall time saved = 99.39%
18
Clustering
Hierarchical Agglomerative Clustering (HAC)
k-means Unsupervised measure of clusters’
validity Average Silhouette Coefficient (ASC)
[19]
19
Clustering
min_sup=4%, min_sfipf=0.01
k-means
# of Clusters2 4 6 8 10 12 14 16 18 20
AS
C0.0
0.1
0.2
0.3
0.4Cosine sfipf Dice Jaccard Overlap
min_sup=4%, min_sfipf=0.01
HAC
# of Clusters
2 4 6 8 10 12 14 16 18 20
AS
C
0.0
0.1
0.2
0.3
0.4
20
Clustering
ASC Contour map for 10 clusters using HAC
0.08
0.08
0.10
0.10
0.12
0.12
0.16
0.16
0.14
0.140.200.18
min_sup4 6 8 10 12
min
_s
fip
f
0.01
0.02
0.03
0.04
0.05
0.08 0.10 0.12 0.14 0.16 0.18 0.20
ASC Contour map for 10 clusters using k-means
0.04
0.04
0.06
0.06
0.060.08
0.08
0.08
0.10
0.14
0.12
0.10
0.10
min_sup4 6 8 10 12
min
_sfi
pf
0.01
0.02
0.03
0.04
0.05
0.04 0.06 0.08 0.10 0.12 0.14
21
Design Pipeline
Preprocessor
Frequent Subgraph Discovery
Pathway Graphs
Frequent Subgraph
s
Clustering
STKE Dataset
NN Storytelling
22
Pathway Relations (StoryTelling)
Bidirectional Search Cover tree for NN
S
p1
p2
p3
T
p7
p8
p9
Day-to-day life example
Roman Holiday
SabrinaBreakfast
at Tiffany’sSome
Like it HotRear
Window
2001: A Space Odyssey
Golden Eye
Die Another Day
Terminator 3
Terminator 3Collateral damage
Lethal Weapon 4
Die Hard 2
SpeedAir Force
One
U.S. Marshals
S.W.A.T.The day after
Tomorrowvan
HelsingBlade: Trinity
Roman Holiday
SabrinaFunny Face
Deep in my Heart
Singing in the rain
An American in Paris
Kismet
Kiss me Kate
High Society
Anchors Aweigh
On the Town
Take me out to the Ball Game
From Roman Holiday
From Terminator 3
From: Roman HolidayTo: Terminator 3
24
Examples in STKE
http://people.cs.vt.edu/msh/infoviz/3/
25
Pathway Relations (StoryTelling)
Numbers of varying length storiesfor different branching factor
Story length, t
3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nu
mb
er
of
t-le
ng
th s
tori
es
0
50
100
150
200
250
300
350
b=2b=4b=6b=8
26
Pathway Relations (StoryTelling)
Numbers of varying length storiesfor different branching factor
Story length, t
3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nu
mb
er
of
t-le
ng
th s
tori
es
0
50
100
150
200
250
300
350
b=2b=3b=4b=5b=6b=7b=8b=9b=10
27
Pathway Relations (StoryTelling)
Branching factor, b
2 3 4 5 6 7 8 9 10
To
tal
sto
ries
fro
m a
ll p
airs
0
200
400
600
800
1000
Branching factor, b2 3 4 5 6 7 8 9 10
Tim
e to
gen
erat
eal
l st
ori
es (
ms)
0.0
200.0x103
400.0x103
600.0x103
800.0x103
1.0x106
1.2x106
1.4x106
Branching factor, b
2 3 4 5 6 7 8 9 10
Len
gth
of
the
lon
ges
t s
tory
4
6
8
10
12
14
16
28
Future Directions
Compare our SEG graph methods with text based clustering and storytelling
Examine costs and benefits for combining text and graph mining techniques
29
References
[1] Science Signaling, The signal Transduction Knowledge Environment (STKE), "The Database of Cell Signaling", http://stke.sciencemag.org/cm/
[2] Kuramochi, M. and Karypis, G., "An efficient algorithm for discovering frequent subgraphs", IEEE Transactions on KDE, Vol. 16(9), September 2004, pp. 1038-1051.
[3] Breslin, T., Krogh, M., Peterson, C., and Troein, C., "Signal transduction pathway profiling of individual tumor samples", BMC Bioinformatics, June 29, 2005.
[4] Kumar, D., Ramakrishnan, N., Helm, R. F., and Potts, M., "Algorithms for Storytelling", IEEE Transactions on KDE, Vol. 20(6), June 2008, pp. 736-751.
[5] Ratprasartporn, N., Cakmak, A., and Ozsoyoglu, G., "On Data and Visualization Models for Signaling Pathways", 18th SSDBM, 2006, pp. 133-142.
[6] Xu, X., and Yu, Y., "Modeling and Verifying WNT Signaling Pathway", 3rd Intl. Conf. on ICNC. 2007, Vol. 2, pp. 319 - 323.
[7] Schreiber, F., "Comparison of metabolic pathways using constraint graph drawing", 1st Asia-Pacific bioinformatics Conf. on Bioinfo., Australia, Vol. 19, 2003, pp. 105 - 110.
[8] Abello, J., van Ham, F., and Krishnan, N., "ASKGraphView: A Large Scale Graph Visualization System", IEEE Transactions on Visualization and Computer Graphics, Vol. 12(5), 2006, pp. 669 - 676.
[9] Miyake, S., Tohsato, A., Takenaka, Y., and Matsuda, H. "A clustering method for comparative analysis between genomes and pathways", 8th Intl. Conf. on Database Systems for Advanced Applications, March 2003 pp. 327 - 334.
30
References[10] Yan, X., and Han, J. "gSpan: graph-based substructure pattern mining", IEEE ICDM, 2002, pp. 721-
724.
[11] Moti, C., and Ehud, G. "Diagonally Subgraphs Pattern Mining", 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2004, pp. 51-58.
[12] Ketkar, N., Holder, L., Cook, D., Shah, R., and Coble, J. "Subdue: Compression-based Frequent Pattern Discovery in Graph Data", ACM KDD Workshop on Open-Source Data Mining, August 2005, pp. 71-76.
[13] Zhang, T., Ramakrishnan, R., and Livny, M., "BIRCH: An Efficient Data Clustering Method for Very Large Databases", ACM SIGMOD Intl. Conf. on Management of Data, Canada, 1996, pp. 103-114.
[14] Wagsta, K., Cardie, C., Rogers, S., and Schroedl, S., "Constrained K-means Clustering with Background Knowledge", ICML 2001, pp. 577-584.
[15] Lin, F., and Hsueh, C. M., "Knowledge map creation and maintenance for virtual communities of practice", Intl. Journal of Information Processing and Management, ACM, Vol. 42(2), 2006, pp. 551-568.
[16] Beygelzimer, A., Kakade, S., Langford, J., "Cover trees for nearest neighbor", ICML 2006, pp. 97-104.
[17] Agrawal, R., and Srikant, R. "Fast Algorithms for Mining Association Rules", Intl. Conf. on Very Large Data Bases, Santiago, Chile, September 1994, pp. 487-499.
[18] Agrawal, R., Mehta, M., Shafer, J., Srikant, R., Arning, A. and Bollinger, T. "The Quest Data Mining System", KDD'96, USA, 1996, pp. 244-249.
[19] Tan, P. N., Steinbachm, M., and Kumar, V., "Introduction to Data Mining", Addison-Wesley, ISBN: 0321321367, April 2005, pp. 539-547.
[20] http://people.cs.vt.edu/amonika/infoviz/
31
Thank You