Post on 18-Jan-2018
description
transcript
A validation method forfuzzy clusteringA biological problem of gene expression data
Thanh Le, Katheleen J. GardinerUniversity of Colorado Denver
July 18th, 2011
Overview Introduction
Data clustering: approaches and current challenges fzBLE
a novel method for validation of clustering results Datasets
artificial and real datasets for testing fzBLE Experimental results Discussion:
Advantages and limitations of fzBLE
Clustering problem Genes are clustered based on
Similarity Dissimilarity
Clusters are described by Boundaries & overlaps Number of clusters Compactness within clusters Separation between clusters
Clustering approaches Hierarchical approach Partitioning approach
Hard clustering approach Crisp cluster boundaries Crisp cluster membership
Soft/Fuzzy clustering approach Overlapping cluster boundaries Soft/Fuzzy membership Appropriate for many real-world problems
Fuzzy C-Means algorithm The model
Features:Fuzzy membership, soft cluster boundaries,One gene can belong to multiple clusters & be assigned to multiple biological processes
c
1kki
2ki
n
1i
c
1k
mki
n..1i,1u
1mmin,vxu)V,U|X(J
Fuzzy C-Means (contd.) Possibility-based model Model parameters estimated using an iteration process Rapid convergence Most appropriate for gene expression data Challenges:
Determining the number of clusters Avoiding local optima The goodness-of-fit to validate
clustering results
Methods for fuzzy clustering validation Methods based on compactness and separation
Problem: Over-fit - the larger the number of cluster is, the better the cluster index is. No rationale for how to scale the two factors in the model
Methods based on goodness of fit Statistics approach Expectation-Maximization (EM) method Problem:
Slowly convergent, particularly at cluster boundaries because of the exponential function. Inappropriate to real dataset because of the model assumption of data distributions: Gaussian, chi-squared…
The fzBLE method for cluster validation1. Cluster using Fuzzy C-Means
clustering algorithm2. Validate using the goodness-of-fit
(the log likelihood estimator) and Bayesian approach
Cluster validation:Goodness-of-fit & fuzzy clustering1. Convert the possibility model into a
probability model2. Use Bayesian approach to compute the
statistics.3. Apply the Central Limit Theory
To effectively represent the data distribution
4. Model selection based on goodness-of-fit
Datasets Artificial datasets
Finite mixture model based datasets Real datasets
Iris, Wine and Glass datasets at UC Irvine Machine Learning Repository
Gene datasets which are more complexYeast cell cycle gene expression (Yeast)Yeast gene functional annotations (Yeast-MIPS)Rat Central Nervous System (RCNS) gene expression
Experimental results onartificial datasets
# clusters fzBLE PC PE FS XB CWB PBMF BR CF
3 1.00 0.42 0.42 0.42 0.42 1.00 1.00 0.83 0.00
4 1.00 0.92 0.92 0.92 0.83 1.00 1.00 1.00 0.00
5 1.00 0.75 0.75 0.83 0.75 0.83 1.00 1.00 0.00
6 1.00 0.92 0.83 0.92 0.58 0.58 1.00 0.92 0.00
7 1.00 0.83 0.83 0.83 0.67 0.58 1.00 0.67 0.00
8 1.00 1.00 0.92 1.00 0.92 0.67 1.00 0.83 0.00
9 1.00 0.92 0.67 0.92 0.67 0.33 1.00 0.83 0.00
PC-partition coefficient, PE-partition entropy, FS-Fukuyama-Sugeno, XB-Xie and Beni, CWB-Compose Within and Between scattering, PBMF-Pakhira, Bandyopadhyay and Maulik Fuzzy, BR-Rezaee B., CF-Compactness factor; loop=5, #cluster range=[2,12]
Correctness Ratios in determining the number of clusters
Experimental results onGlass dataset
# clusters
fzble PC PE FS XB CWB PBMF BR CF
2 -1135.688
6
0.8884 0.1776 0.3700 0.7222 6538.9311 0.3732
1.9817
0.5782
3 -1127.685
4
0.8386 0.2747 0.1081 0.7817 4410.3006 0.4821
1.5004
0.4150
4 -1119.245
7
0.8625 0.2515 -0.0630 0.6917 3266.5876 0.4463
1.0455
0.3354
5 -1123.282
6
0.8577 0.2698 -0.1978 0.6450 2878.8912 0.4610
0.8380
0.2818
6 -1113.833
9
0.8004 0.3865 -0.2050 1.4944 5001.1752 0.3400
0.8371
0.2430
7 -1116.572
4
0.8183 0.3650 -0.2834 1.3802 5109.6082 0.3891
0.6914
0.2214
8 -1127.262
6
0.8190 0.3637 -0.3948 1.4904 7172.2250 0.6065
0.5916
0.2108
9 -1117.748
4
0.8119 0.3925 -0.3583 1.7503 8148.7667 0.3225
0.5634
0.1887
10 -1122.158
5
0.8161 0.3852 -0.4214 1.7821 9439.3785 0.3909
0.4926
0.1758
11 -1121.984
8
0.8259 0.3689 -0.4305 1.6260 9826.4211 0.3265
0.4470
0.1704
12 -1135.045
3
0.8325 0.3555 -0.5183 1.4213 11318.4879
0.5317
0.3949
0.1591
13 -1138.946
2
0.8317 0.3556 -0.5816 1.4918 14316.7592
0.6243
0.3544
0.1472
Algorithm Cluster Validity Scores and Decisions (highlighted in yellow)
Experimental results on RCNS - more complex dataset; two-factor scaling issue
#clusters
fzble PC PE FS XB CWB PBMF BR CF
2 -580.072
8
0.9942
0.0121
-568.797
2
0.0594
5.5107 4.2087
1.1107
177.8094
3 -564.198
6
0.9430
0.0942
-487.610
4
0.4877
4.1309 4.2839
1.6634
117.9632
4 -561.016
9
0.9142
0.1470
-430.486
3
0.9245
6.1224 3.3723
1.3184
99.1409
5 -561.742
0
0.8900
0.1941
-397.093
5
1.3006
9.4770 2.6071
1.1669
88.5963
6 -552.915
3
0.8695
0.2387
-300.656
4
2.5231
20.6496
1.9499
1.1026
84.0905
7 -556.290
5
0.8707
0.2386
-468.312
1
2.1422
21.0187
2.8692
0.7875
57.5159
8 -555.350
7
0.8925
0.2078
-462.067
3
1.7245
20.0113
2.5323
0.5894
52.0348
9 -558.868
6
0.8863
0.2192
-512.427
8
1.6208
22.4772
2.6041
0.5019
45.9214
10 -565.836
0
0.8847
0.2241
-644.145
1
1.1897
21.9932
3.4949
0.3918
33.1378
Algorithm Cluster Validity Scores and Decisions (highlighted in yellow)
• 112 genes during RCNS development at 9 time points• 6 clusters, 4 of which are functionality-annotated (Somogyi et al. 1995, Wen et al. 1998)
Discussion:The advantages of fzBLE Performs better than other
approaches on 3 levels of data. Compactness-separation approaches
Solves the over-fit problem using goodness-of-fit.
Eliminates need for two scaling factors Mixture model with EM approach
Rapid convergence No assumption on data distribution
Discussion:The limitations of fzBLE Depends on internal validity External validities are needed
Biological validity GO terms, Pathways, PPI
Future work on gene expression: Distance definition based on biological
context Combine fzBLE with biological homology and
stability indices
Thank you!
Questions?
We acknowledge the support from National Institutes of Health Linda Crnic Institute Vietnamese Ministry of Education and
Training