Goals and Data Availability
Feature Selection
Unsupervised Clustering of Patient Gene Profiles Supervised Classification and Regression Biological Significance Feature Selection
Highlights Gene Ontology Networks and KEGG Pathways Implicated in Cancer
Role of alpha-6beta-4 integrins in carcinoma progression
Outcome Pathway FDR
Estrogen Receptor TGF-Beta-dependent induction of EMT via MAPK 575e-2
Progesterone Receptor
Slit-Robot signaling 796e-2
Genes in Feature SetTGB-Beta TGF-Beta 3
PTEN Claudin-1SLIT1 SLIT3
Progestrone Receptor
Cadherin-mediated cell adhesion 796e-2 M-cadherin P-cadherin
HER2 Receptor ERBB-family signaling 501e-2 ERBB2 GRB7
HER2 Receptor Role of alpha-6beta-4 cadherins in carcinoma progression
501e-2 ERBB2 ITGB4 14-3-3 tau
Survivability Cell adhesion and ECM remodeling
551e-2 Kallikrein 3 Kallikrein2 Kallikrein 1
Survivability Bradykinin kallidin maturation 488e-2 CPB2 Tissue Kallikreinin Plasma Kallikreinin
Survivability Role of Parkin in the ubiquitin-proteasomal pathway
875e-2 UBE1 Cyclin E HSP70 Tubulin Beta
Cell adhesion and ECM remodeling
ER Gene set overlaps with ESR1 (nuclear) MALL Hyaluronic acid
NCOA3 Estradiol network (5461 genes present in feature set)
GO processes
Reproductive structure development (20572e-7)
Female Sex Differentiation (109 p=824e-6)
Growth Hormone Receptor Signaling pathway
(52 p=343e-18)
HER2 gene set overlaps with Survivin Bcl-6 AKT (PKB) Fra-1 Bim
Network(114180 genes present in feature set)
GO processes
Regulation of cell death (43p=24e-29)
Regulation of apoptotic process (414p=982e-29)
Regulation of programmed cell death (414p=165e-28)Apoptotic signaling pathway
(225p=478e-27)
ANOVA p-value for Survivability = 180e-4
Tukey HSD Difference of Means (Days)
Expected +346 -346
Lower +304 -304
Upper +386 -386
Hierarchical ClusteringBreast cancer is the most commonly diagnosed and the second leading cause of death among women The highly heterogeneous disease has many different clonal subtypes that cause different patient treatment responses due to the genetic differences from the
disease and the patient To better understand the complicated nature of this disease The Cancer Genome
Atlas (TCGA) a rich database of genetic and electronic medical information can be mined to identify distinctive features for each clonal subtype and patient survivability
Using the significant genes found unsupervised and supervised methods can be applied to find genetic differences within clonal subgroups and to predict clinical
outcomes based on these given features
Chi-Squared p-value for Estrogen Receptor = 495e-43
Outcome Chi-Squared Standardized Residuals0 -10 24
1 10 -25
-78 163 -15 -42 -22 -36 -28
79 -161 13 42 23 36 28
Gen
es w
ith th
e st
rong
est e
ffect
siz
e
Gen
es w
ith th
e st
rong
est
effe
ct s
ize
K-Means Clustering
Forward subset selection 1 Linearlogistic regression of each feature against nominalcategorical
outcomes (respectively) was performed
2 Features with FDR gt005 for association with outcome were
excluded3 regsubsets library in R used to perform greedy forward subsets selection on remaining features
4 Feature subset size selected to minimize the BIC
Elastic net regularizationand Group lasso
Glmnet package in R
Input RNA-Seq gene expression levels for 20531 genes
across 800 subjects
Pre-proecessing 1 Removed 475 genes with low expression (lt 1FPKM)
2 Removed 699 genes with lt01FPKM^2 expression variance
ResultSurvivability 277 genes Age of Diagnosis 194 genes
Cancer stage 140 genes Estrogen Receptor 176 genes Progestone Receptor 175 genes Her2 Receptor 52 genes
Using 10-fold cross-validationand selecting feature subset size to
minimize BIC
P-Value = 126e-31 P-Value = 089
P-Value = 138e-07 P-Value = 166e-26
Group lasso (alpha=1)
Elastic net regularization (alpha=05)
The number of patients in each clinical categorical feature and the colored bars represent data used in further analysis
Forward Selection
Elastic Net Regulatization
Group Lasso
391
149 15672
31
204
7
Regressor RMSE Median Absolute Error
SVR Linear Kernel 76429 49739
SVR Sigmoid Kernel 121204 90900
R2
060
000
Linear Regression 47935 38136 084
Patient Distribution in Breast Cancer TCGA Clinical Features
TRIM72 ZFP42
LACRT TMIGD1
ACSM1 CCNE1
SCUBE2 PP1R1C
IL19 CPB2
PCA and K-Means++ were used to create the clusters below The distinctive groups suggest more patient subgroups that can be helpful for more precise treatments
Regressor RMSE Median Absolute Error
SVR Linear Kernel 231 136
SVR Sigmoid Kernel 254 090
R2
010
- 008
Linear Regression 205 135 030
Tumor Stage ZNF705A LCE3E
PZP
PRDM14ZPBP2
ZIC4 LOC93432 LCT GSDMC
B3GNT5 SOX11SEL1L3HAPLN3 MALL CLDN1 CDH3 MMP1 S100A1
PADI2
Survivability
Survival Analysis
Prediction of clonal subtypes in breast invasive carcinoma Hunter Boyce Alice Yu Anna Shcherbina
CS229 Fall 2015
Range 1-10
Range 158-4456 days
p53 Bcl-6
Red bars indicate that pathway gene is present in the feature set
ClassificationReceiver Operator Characteristic curves for the prediction of two subtypes of cancer
Area under the curve given with the classifier in the legend
RegressionStatistical metrics of regression performance on RNA expression for predicting tumor
stage and survival time
Kaplan-Meier plots for the survival of sub-populations of patients Results show the opposite of what is expected This should be further explored