Post on 03-Jan-2016
transcript
Biology-Driven Clustering of Microarray Data
Applications to the NCI60 Data Set
K.R. Coombes, K.A. Baggerly, D.N. Stivers,
J. Wang, D. Gold, H.G. Sung, and S.J. Lee
Introduction• Microarray data is more than a large,
unstructured matrix.– We already know many genes important
for studying cancer through their involvement in specific biological processes
– We also know that reproducible chromosomal abnormalities play an important role in cancer
• Need analytical methods that use biological information early
Methods• First, updated the annotations of
the genes on the microarray• Performed separate analyses
– using genes on individual chromosomes
– using genes involved in different biological processes
• Developed ways to assess how well each set of genes classified samples
Quality of Annotations
• Problem:– I.M.A.G.E. clone IDs and GenBank
accession numbers are archival– UniGene clusters, gene names,
descriptions, functions, etc., are changeable
• Solution:– Download latest UniGene (build 137)
and LocusLink to update annotations
How many genes on the array have good
annotations?Numberof Spots
Current UniGeneStatus
294 None (control spots)128 Only 3’ – unknown to UniGene
1379 Only 3’ – known to UniGene1 Only 5’ – unknown6 Only 5’ – known
399 Both – unknown763 Both – 3’ known, 5’ unknown291 Both – 3’ unknown, 5’ known646 Both known, but disagree
6093 Both known, and agree
Only trust the 7478 spots where the UniGene clusters match.
Where are the genes located?
Chromosome
(Ob
serv
ed
- E
xpe
cte
d)
/ SD
5 10 15 20
-6-4
-20
24
6
X Y
chi^2 = 148.8p < 10^(-10)
How do we determine the functions of genes?
• UniGene -> LocusLink -> GeneOntology
• GeneOntology is a structured, hierarchical vocabulary to describe gene functions in three broad areas:– biological process (why)– molecular function (what)– cellular component (where)
What kinds of genes are on the microarray?
Function Ann. Spots Function Ann. Spots
Oncogenesis 140 180 Cell shape and size 78 101Apoptosis 128 138 Protein traffic 157 188Physiological proc. 180 210 Transport 146 136Perc. of ext. stimuli 238 150 Cell proliferation 197 249Ectoderm devel. 129 152 Stress response 599 372Mesoderm devel. 92 102 Radiation response 147 136Cell adhesion 111 140 Cell cycle 494 283Cell-cell signaling 137 166 Nucleic acid met. 695 595Signal transduction 222 228 Protein metabolism 471 567Intracell sig cascade 110 110 Lipid metabolism 146 156Cell motility 120 153 Carbohydrate met. 103 97Cell organization 98 118 Energy pathways 88 98
Data Preprocessing
• Remove spots with poor annotations and spots with median intensity below the 97th percentile of empty spots.
• Normalize each array so median log ratio between channels is one
• Center each gene so mean log ratio across experiments is zero
• Use (1-correlation)/2 as distance metric
How well does a set of genes distinguish types of
cancer?• Three methods for assessment:
– Qualitative (PCA, MDS)– Quantitative (PCA + ANOVA)– Semi-quantitative (Grading
Dendrograms)
Multidimensional Scaling
coordinate 1
coo
rdin
ate
2
-0.2 -0.1 0.0 0.1 0.2
-0.1
0.0
0.1
0.2
0.3
B
BBB
BB
BB
S
S
SS
S
SC
C
CC
C
C
CLLL
LL
L
M
M
M
M
M
MMM
N
N
N
N
N
NN
N N
O OO
OOOP
P
R
R
R
R
R
R
R
R
How good is a dendrogram?
• A = cluster contains all and only one kind of cancer
• B = all, with extras• C = all except one• D = all except one,
with extras• E = all except two• F = all except two,
with extras
0.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295cns.sf268
cns.sf539
cns.snb19cns.u251
colon.ht29colon.hct116
colon.hct15
colon.km12
colon.sw620
colon.hcc2998
colon.colo205
leukemia.k562leukemia.hl60
leukemia.rpmi8226leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5
melanoma.malme3m
melanoma.skmel28melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4ovarian.3
ovarian.8
ovarian.5
ovarian.igrov1ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1
renal.achnrenal.tk10
renal.sn12c
renal.rxf393
renal.uo31renal.786o
renal.a498
breast.unknown
Cancer B C L M N O P R S
Score A A D F D C B
Can cancers be distinguished by genes on
one chromosome?ch B C L M N O P R S ch B C L M N O P R S
1 B A D F D B 13 D E2 E C D D E D E 14 A A F3 C E D E F 15 C B C F C4 E E E E 165 A A D F E 17 A A D F E E6 C A D E E D 18 E D7 E A D E C E 19 D D8 E C D 20 E C9 B C C E E E 21
10 D E 22 A E E11 E C C D X B A D E D12 B C C E E E
Heterogeneity of different types of cancer
• Some cancers (colon, leukemia) are fairly easy to distinguish from others
• Some (breast, lung) are so heterogeneous as to be almost impossible to distinguish
• Some chromosomes (1, 2, 6, 7, 9, 12, 17) can distinguish many cancers.
• Some (16, 21) are essentially random
0.0
0.2
0.4
0.6
0.8
Chromosome 2
0.00.20.40.6
breast.bt549breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19cns.u251
colon.ht29
colon.hct116
colon.hct15
colon.km12
colon.sw620
colon.hcc2998
colon.colo205
leukemia.k562leukemia.hl60
leukemia.rpmi8226leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5melanoma.malme3m
melanoma.skmel28
melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4ovarian.3
ovarian.8
ovarian.5ovarian.igrov1ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1
renal.achn
renal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
0.0
0.2
0.4
0.6
0.8
Chromosome 16
0.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19
cns.u251
colon.ht29
colon.hct116
colon.hct15colon.km12
colon.sw620
colon.hcc2998colon.colo205
leukemia.k562
leukemia.hl60
leukemia.rpmi8226
leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5
melanoma.malme3m
melanoma.skmel28melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4
ovarian.3
ovarian.8
ovarian.5
ovarian.igrov1
ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1renal.achn
renal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
Can cancers be distinguished by genes of
one function?• Table for functional categories looks a
lot like the table for chromosomes• Some biological process categories
(signal transduction, cell proliferation, cell cycle, protein metabolism) can distinguish many types of cancer
• Others (apoptosis, energy pathways) cannot
0.0
0.2
0.4
0.6
0.8
cell surface receptor linked signal transduction
0.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19cns.u251
colon.ht29
colon.hct116
colon.hct15
colon.km12
colon.sw620
colon.hcc2998
colon.colo205
leukemia.k562leukemia.hl60
leukemia.rpmi8226
leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5
melanoma.malme3m
melanoma.skmel28
melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4
ovarian.3
ovarian.8
ovarian.5
ovarian.igrov1
ovarian.skov3
prostate.du145
prostate.pc3cns.snb75
renal.caki1
renal.achn
renal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
0.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19cns.u251
colon.ht29
colon.hct116
colon.hct15colon.km12
colon.sw620colon.hcc2998
colon.colo205
leukemia.k562leukemia.hl60leukemia.rpmi8226
leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2melanoma.skmel5melanoma.malme3mmelanoma.skmel28melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4
ovarian.3ovarian.8
ovarian.5ovarian.igrov1
ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1
renal.achnrenal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown0.
00.
20.
40.
60.
8
protein metabolism and modification
0.0
0.2
0.4
0.6
0.8
death (apoptosis)
0.00.20.40.6
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19
cns.u251
colon.ht29
colon.hct116
colon.hct15
colon.km12
colon.sw620
colon.hcc2998
colon.colo205
leukemia.k562
leukemia.hl60leukemia.rpmi8226leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577
melanoma.m14
melanoma.skmel2
melanoma.skmel5
melanoma.malme3m
melanoma.skmel28melanoma.uacc62
nsclc.h322
nsclc.hop62nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4
ovarian.3
ovarian.8
ovarian.5
ovarian.igrov1ovarian.skov3
prostate.du145
prostate.pc3
cns.snb75
renal.caki1
renal.achn
renal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
0.0
0.2
0.4
0.6
energy pathways
0.00.20.40.60.8
breast.bt549
breast.hs578t
breast.mcf7
breast.mdamb231
breast.mdamb435
breast.mdan
breast.t47d
cns.sf295
cns.sf268
cns.sf539
cns.snb19
cns.u251
colon.ht29
colon.hct116
colon.hct15colon.km12
colon.sw620
colon.hcc2998
colon.colo205
leukemia.k562
leukemia.hl60
leukemia.rpmi8226
leukemia.srcl7019
leukemia.molt4leukemia.ccrfcem
melanoma.loximvi
melanoma.uacc577melanoma.m14
melanoma.skmel2
melanoma.skmel5
melanoma.malme3m
melanoma.skmel28
melanoma.uacc62
nsclc.h322
nsclc.hop62
nsclc.h23
nsclc.ekvx
nsclc.h226
nsclc.a549
nsclc.h460
nsclc.hop92
nsclc.h522
ovarian.4
ovarian.3
ovarian.8
ovarian.5
ovarian.igrov1
ovarian.skov3
prostate.du145prostate.pc3
cns.snb75
renal.caki1
renal.achn
renal.tk10
renal.sn12c
renal.rxf393
renal.uo31
renal.786o
renal.a498
breast.unknown
Conclusions (I)
• Multiple views into the data provide substantial insight into differences in cancer types and gene sets.
• Cancer types differ greatly in their degree of heterogeneity, ranging from homogeneous (colon, leukemia) through moderately heterogeneous (renal, melanoma) to extremely heterogeneous (breast and lung).
Conclusions (II)
• Homogeneous cancers exhibit strong identifying signals across most views of the data.
• There are large difference in the ability of genes of different chromosomes or involved in different biological processes to distinguish cancer types.