Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 ·...

Colon cancer subtypes from gene expression data

Nathan Cunningham Giuseppe Di Benedetto Sherman IpLeon Law

Module 6: Applied Statistics

26th February 2016

Aim

I Replicate findings of Felipe De Sousa et al. (2013)I Cluster analysis to identify subtypes of colon cancerI Construct a classifier to identify clustersI Identify a suitable subset of the data to perform these analyses

I Consider robustness of findings to changes in methods andperturbations in the data

Data

I GSE33113 data set (Academic Medical Centre, Amsterdam)

I Patients with stage II colon cancer

I 90 patients, 54, 675 gene expressions recorded

Data processing

I Normalisation to remove batch effects

I Gene expression presence detected using barcode algorithmand those not present in at least one sample removed

I Genes with a median absolute deviation > 0.5 retained andmedian centred

I Felipe De Sousa et al. (2013) find 7, 846 genes remain — wefind anywhere from none to all of the genes remain

I Use 146 genes identified by Felipe De Sousa et al. (2013) inanalyses

Cluster Analysis

I Hierarchical – agglomerative, average linkage

I K-Means

I Consensus

I Model-based (Fraley & Raftery, 2002)

How many clusters?

Clustering methods comparison

I Homogeneity: reflects compactness of the clusters

I Separation: reflects the distance between clusters

I Silouette: s(i) = b(i)−a(i)max{a(i),(b(i)}

I WADP (weighted avarage discrepancy pairs)

Clustering methods comparison

I Homogeneity: reflects compactness of the clusters

I Separation: reflects the distance between clusters

I Silouette: s(i) = b(i)−a(i)max{a(i),(b(i)}

I WADP (weighted avarage discrepancy pairs)

Robustness under perturbation

●●● ●

● ●

●

●●● ●

●

● ●

●

●●

●

●

●

●

0.00

0.05

0.10

0.15

0.20

0.0 0.5 1.0 1.5 2.0sd

valu

e

variable●

●

●

cons_kmeans

mclust

cons_hierclust

Cluster methods comparison

Cluster comparison WADP valueC-k-means VS C-hierarchical 0.070

MClust VS C-hierarchical 0.015C-k-means VS Mclust 0.081

Classification: PAM

I R package for implementing nearest shrunken centroidclassification.

I Gives higher weights to genes in a class that are far away fromthe overall centroid of the genes.

I A threshold parameter specifies a shrinkage for the weightsgiving higher weights to genes which are stable within theclass.

I Can eliminate the ‘weaker’ effect of genes, allowing automaticfeature selection.

I Classification by considering the smallest distance to theshrunken centroid.

Classification: PAM






Classification: PAM






Classification: PAM






Classification: PAM






Classification: Multi-Class SVM

I The R package e1071 was used to perform the multi-classSVM with a RBF kernel.

I Uses a one vs one approach (i.e. 3 binary classifiers) with classprediction done by a voting scheme.

I If a linear kernel was used instead, could perform featureselection based on ranking of the features using their weights,









Classification: Random Forest

I The R package randomForest was used to train a randomforest.

I A total of 300 trees were built, with 12 variables randomlychosen as candidates at each split.

I Feature selection can be done using mean decrease accuracy,which uses permutation of the features and out of bag error.









Results: PAM

0 2 4 6 8 10

Value of threshold

Mis

clas

sific

atio

n E

rror

146 146 145 127 110 89 70 57 46 35 28 19 13 7 7 6 4

Number of genes

0.0

0.4

0.8

x x x x x x x x x x x x x x x x x x x

0 2 4 6 8 10

Value of threshold

Mis

clas

sific

atio

n E

rror

146 146 145 127 110 89 70 57 46 35 28 19 13 7 7 6 4

0.0

0.4

0.8

Label 1

Label 2

Label 3

Figure: 10-fold cross validation error. Optimal threshold was estimated tobe 6.2 ± 0.2.

Results: SVM and Random Forest and PAM

Method 10-Fold Cross Validation Error

SVM (C = 1, γ = 0.0068) 1.1%PAM (threshold = 6.2) 2.2%

Random Forest 3.3%

Table: 10-fold cross validation average error on the trained classifiers

Error bars can be estimated using bootstrapping.

Results: SVM and Random Forest and PAM


SVM (C = 1, γ = 0.0068) 1.1%PAM (threshold = 6.2) 2.2%

Random Forest 3.3%

Table: 10-fold cross validation average error on the trained classifiers

Error bars can be estimated using bootstrapping.

Results: PAM Bootstrapping

0 2 4 6 8 10

010

2030

Threshold (unknown units)

Val

idat

ion

Err

or (

%)

Figure: Median (point) and 95% percentile (error bar) of the 10-foldcross validation error, bootstrapping 500 times.


0 2 4 6 8 10

020

4060

8010

014

0

Threshold (unknown units)Num

ber

of g

enes

whi

ch s

urvi

ved

thre

shol

ding

(ge

nes)

Figure: Mean (point) and standard deviation (error bar) of the number ofgenes which survived thresholding, bootstrapping 500 times.



PAM (threshold = 0.0)(1.1+2.2−1.1

)%

PAM (threshold = 6.2)(2.2+3.3−2.2

)%

SVM(1.1+1.1−0.0

)%

Random Forest(1.1+2.2−1.1

)%

Table: Median and 95% percentile of the 10-fold cross validation error,bootstrapping 500 times.

For PAM with threshold 6.2, (36.5 ± 6.3) genes survivedthresholding.

Conclusion

I Clustering methods were robust

I PAM performed similar to other methods

I More thresholds to be investigated

I Scale to larger datasets

References

Felipe De Sousa, E. M., Wang, X., Jansen, M., Fessler, E., Trinh,A., de Rooij, L. P., . . . others (2013). Poor-prognosis coloncancer is defined by a molecularly distinct subtype anddevelops from serrated precursor lesions. Nature medicine,19(5), 614–618.

Fraley, C., & Raftery, A. E. (2002). Model-based clustering,discriminant analysis, and density estimation. Journal of theAmerican statistical Association, 97(458), 611–631.

Date post:	02-Aug-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Colon cancer subtypes from gene expression datasip/work/presentation_applied.pdf · 2016-02-27 ·...

Documents