Colon cancer subtypes from gene expression data
Nathan Cunningham Giuseppe Di Benedetto Sherman IpLeon Law
Module 6: Applied Statistics
26th February 2016
Aim
I Replicate findings of Felipe De Sousa et al. (2013)I Cluster analysis to identify subtypes of colon cancerI Construct a classifier to identify clustersI Identify a suitable subset of the data to perform these analyses
I Consider robustness of findings to changes in methods andperturbations in the data
Data
I GSE33113 data set (Academic Medical Centre, Amsterdam)
I Patients with stage II colon cancer
I 90 patients, 54, 675 gene expressions recorded
Data processing
I Normalisation to remove batch effects
I Gene expression presence detected using barcode algorithmand those not present in at least one sample removed
I Genes with a median absolute deviation > 0.5 retained andmedian centred
I Felipe De Sousa et al. (2013) find 7, 846 genes remain — wefind anywhere from none to all of the genes remain
I Use 146 genes identified by Felipe De Sousa et al. (2013) inanalyses
Cluster Analysis
I Hierarchical – agglomerative, average linkage
I K-Means
I Consensus
I Model-based (Fraley & Raftery, 2002)
How many clusters?
Clustering methods comparison
I Homogeneity: reflects compactness of the clusters
I Separation: reflects the distance between clusters
I Silouette: s(i) = b(i)−a(i)max{a(i),(b(i)}
I WADP (weighted avarage discrepancy pairs)
Clustering methods comparison
I Homogeneity: reflects compactness of the clusters
I Separation: reflects the distance between clusters
I Silouette: s(i) = b(i)−a(i)max{a(i),(b(i)}
I WADP (weighted avarage discrepancy pairs)
Robustness under perturbation
●●● ●
● ●
●
●●● ●
●
● ●
●
●●
●
●
●
●
0.00
0.05
0.10
0.15
0.20
0.0 0.5 1.0 1.5 2.0sd
valu
e
variable●
●
●
cons_kmeans
mclust
cons_hierclust
Cluster methods comparison
Cluster comparison WADP valueC-k-means VS C-hierarchical 0.070
MClust VS C-hierarchical 0.015C-k-means VS Mclust 0.081
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
Classification: PAM
I R package for implementing nearest shrunken centroidclassification.
I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.
I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.
I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.
I Classification by considering the smallest distance to theshrunken centroid.
Classification: Multi-Class SVM
I The R package e1071 was used to perform the multi-classSVM with a RBF kernel.
I Uses a one vs one approach (i.e. 3 binary classifiers) with classprediction done by a voting scheme.
I If a linear kernel was used instead, could perform featureselection based on ranking of the features using their weights,
Classification: Multi-Class SVM
I The R package e1071 was used to perform the multi-classSVM with a RBF kernel.
I Uses a one vs one approach (i.e. 3 binary classifiers) with classprediction done by a voting scheme.
I If a linear kernel was used instead, could perform featureselection based on ranking of the features using their weights,
Classification: Multi-Class SVM
I The R package e1071 was used to perform the multi-classSVM with a RBF kernel.
I Uses a one vs one approach (i.e. 3 binary classifiers) with classprediction done by a voting scheme.
I If a linear kernel was used instead, could perform featureselection based on ranking of the features using their weights,
Classification: Random Forest
I The R package randomForest was used to train a randomforest.
I A total of 300 trees were built, with 12 variables randomlychosen as candidates at each split.
I Feature selection can be done using mean decrease accuracy,which uses permutation of the features and out of bag error.
Classification: Random Forest
I The R package randomForest was used to train a randomforest.
I A total of 300 trees were built, with 12 variables randomlychosen as candidates at each split.
I Feature selection can be done using mean decrease accuracy,which uses permutation of the features and out of bag error.
Classification: Random Forest
I The R package randomForest was used to train a randomforest.
I A total of 300 trees were built, with 12 variables randomlychosen as candidates at each split.
I Feature selection can be done using mean decrease accuracy,which uses permutation of the features and out of bag error.
Results: PAM
0 2 4 6 8 10
Value of threshold
Mis
clas
sific
atio
n E
rror
146 146 145 127 110 89 70 57 46 35 28 19 13 7 7 6 4
Number of genes
0.0
0.4
0.8
x x x x x x x x x x x x x x x x x x x
0 2 4 6 8 10
Value of threshold
Mis
clas
sific
atio
n E
rror
146 146 145 127 110 89 70 57 46 35 28 19 13 7 7 6 4
0.0
0.4
0.8
Label 1
Label 2
Label 3
Figure: 10-fold cross validation error. Optimal threshold was estimated tobe 6.2 ± 0.2.
Results: SVM and Random Forest and PAM
Method 10-Fold Cross Validation Error
SVM (C = 1, γ = 0.0068) 1.1%PAM (threshold = 6.2) 2.2%
Random Forest 3.3%
Table: 10-fold cross validation average error on the trained classifiers
Error bars can be estimated using bootstrapping.
Results: SVM and Random Forest and PAM
Method 10-Fold Cross Validation Error
SVM (C = 1, γ = 0.0068) 1.1%PAM (threshold = 6.2) 2.2%
Random Forest 3.3%
Table: 10-fold cross validation average error on the trained classifiers
Error bars can be estimated using bootstrapping.
Results: PAM Bootstrapping
0 2 4 6 8 10
010
2030
Threshold (unknown units)
Val
idat
ion
Err
or (
%)
Figure: Median (point) and 95% percentile (error bar) of the 10-foldcross validation error, bootstrapping 500 times.
Results: PAM Bootstrapping
0 2 4 6 8 10
020
4060
8010
014
0
Threshold (unknown units)Num
ber
of g
enes
whi
ch s
urvi
ved
thre
shol
ding
(ge
nes)
Figure: Mean (point) and standard deviation (error bar) of the number ofgenes which survived thresholding, bootstrapping 500 times.
Results: PAM Bootstrapping
Method 10-Fold Cross Validation Error
PAM (threshold = 0.0)(1.1+2.2−1.1
)%
PAM (threshold = 6.2)(2.2+3.3−2.2
)%
SVM(1.1+1.1−0.0
)%
Random Forest(1.1+2.2−1.1
)%
Table: Median and 95% percentile of the 10-fold cross validation error,bootstrapping 500 times.
For PAM with threshold 6.2, (36.5 ± 6.3) genes survivedthresholding.
Conclusion
I Clustering methods were robust
I PAM performed similar to other methods
I More thresholds to be investigated
I Scale to larger datasets
References
Felipe De Sousa, E. M., Wang, X., Jansen, M., Fessler, E., Trinh,A., de Rooij, L. P., . . . others (2013). Poor-prognosis coloncancer is defined by a molecularly distinct subtype anddevelops from serrated precursor lesions. Nature medicine,19(5), 614–618.
Fraley, C., & Raftery, A. E. (2002). Model-based clustering,discriminant analysis, and density estimation. Journal of theAmerican statistical Association, 97(458), 611–631.