University at Buffalo The State University of New York
Mining Phenotype Structures
Chun Tang and Aidong ZhangBioinformatics Journal, 20(6):829-838, 2004
University at Buffalo The State University of New York
Microarray Data Analysis Analysis from two angles
sample as object, gene as attribute
gene as object, sample/condition as attribute
University at Buffalo The State University of New York
Supervised Analysis Select training samples (hold out…) Sort genes (t-test, ranking…) Select informative genes (top 50 ~ 200) Cluster based on informative genes
1 1 … 1 0 0 … 01 1 … 1 0 0 … 0
0 0 … 0 1 1 … 1
0 0 … 0 1 1 … 1
g1
g2
.
.
.
.
.
.
.
g4131
g4132
1 1 … 1 0 0 … 01 1 … 1 0 0 … 0
0 0 … 0 1 1 … 1
0 0 … 0 1 1 … 1
g1
g2
.
.
.
g4131
g4132
Class 1 Class 2
University at Buffalo The State University of New York
Unsupervised Analysis
We will focus on unsupervised sample partition which assume no phenotype information being assigned to any sample.
Since the initial biological identification of sample classes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in microarray data analysis.
Many mature statistic methods can not be applied
without the phenotypes of samples being known in
advance.
University at Buffalo The State University of New York
4 5 6 7 8 9 10
gene1
gene6
gene7
gene2
gene4
gene5
gene3
gene1
gene6
gene7
gene2
gene4
gene5
gene3
Unsupervised Analysis
Informative Genes
Non- informative
Genes
samples
An informative gene is a gene which manifests samples' phenotype distinction.
Phenotype structure: sample partition + informative genes.
Automatic Phenotype Structure MiningAutomatic Phenotype Structure Mining
1 2 3
University at Buffalo The State University of New York
gene1
gene2
gene3
Informative genes
1 2 3 4 5 6 7
Phenotype distinctionMining
Mining
Gene expression matrix Result
Automatic Phenotype Structure Mining
Given a n m data matrix M and the number of samples' phenotypes K.
The goal is to find K mutually exclusive groups of the samples matching their empirical phenotypes, and to find the set of informative genes which manifests this phenotype distinction.
University at Buffalo The State University of New York
Requirements
The expression levels of each informative gene should be similar over the samples within each phenotype
The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes
University at Buffalo The State University of New York
Challenges (1)
The volume of genes is very large while the number of samples is very limited, no distinct class structures of samples can be properly detected by the existing techniques.
University at Buffalo The State University of New York
gene1
gene2
gene3
gene4
gene5
gene6
gene7
gene8
gene9
gene10
gene11
gene12
gene13
gene14
gene15
The limited informative
genes are buried in large
amount of noise.
gene5
gene9
gene12
Challenges (2)
University at Buffalo The State University of New York
Challenges (3)
Gene PROTEASOME IOTA X59417Gene C-myb U22376
Gene Fumarylacetoacetate M55150 Gene LTC4 synthase U50136
The values within data matrices are all real numbers
None of the informative genes follows ideal “high-`low” pattern.
University at Buffalo The State University of New York
Related Work New tools using traditional methods :
The similarity measures used in these methods are based on the full gene space.
PCs do not necessarily have strong correlation with informative genes.
TreeView
CLUTO
CIT
CNIO
GeneSpring
J-Express
CLUSFAVOR
• SOM
• K-means
• Hierarchical clustering
• Graph based clustering
• PCA
University at Buffalo The State University of New York
Related Work (Cont’d)
Clustering with feature selection: (CLIFF, two-way ordering, SamCluster)
1. Filtering the invariant genes• Rank variance• PCA• CV
2. Partition the samples• Ncut, Min-Max Cut• Hierarchical Clustering
3. Pruning genes based on the partition• Markov blanket filter• T-test
University at Buffalo The State University of New York
Related Work (Cont’d) Subspace clustering :
• Bi-clustering• δ-clustering
University at Buffalo The State University of New York
Related Work (Cont’d)
Subspace clustering only measure trend similarity. But in our model, we require each gene show consistent signals on the samples of the same phenotype.
University at Buffalo The State University of New York
Related Work (Cont’d)
Subspace clustering algorithms only detect local correlated features and objects without considering dissimilarity between different clusters. We want to get the genes which can differentiate all phenotypes.
University at Buffalo The State University of New York
Our Contributions
We transferred the phenotype structure
mining problem into an optimization problem.
A series of statistic-based metrics are defined
as objective functions.
A heuristic searching method and a mutual
reinforcing adjustment approach are
proposed to find phenotype structures.
University at Buffalo The State University of New York
Model - Measurements
gene1
gene2
gene3
samples
Intra-consistency
Inter-divergency
Phenotype
Quality
Intra-consistency
S1 S2
G’
University at Buffalo The State University of New York
Intra-consistency
Measure-ment
Data(A) Data(B)
residue 0.1975 0.4506
MSR 0.0494 0.4012
Ours 339.0667 5.3000
NOT consistent
consistent
University at Buffalo The State University of New York
Intra-pattern-consistency (Cont’d)
Variance of a single gene on the samples within one phenotype:
Intra-pattern-consistency: average row variance
'
2',, )(
1'
1)',(
SjSiji ww
SSiVar
.)(1''
1
)',('
1)','(
' '
2',,
'
Gg SsSiji
Gg
i j
i
wwSG
SiVarG
SGCon
In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples?
Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better.
University at Buffalo The State University of New York
Inter-pattern-divergence
Both “inter-pattern-consistency” and ``intra-pattern-divergence” on the same gene are reflected.
Average block distance:
')),,'( '
,,
21
21
G
ww
SSGDiv GgSiSi
i
How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples?
Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better.
University at Buffalo The State University of New York
Pattern Quality
The purpose of pattern discovery is to identify the empirical patterns where the intra-pattern-consistency inside each phenotype is high and the inter-pattern-divergence between each pair of phenotypes is large.
);,1(,, ),'(
),'(),'(
1
jiKjiSSji
ji
ji SSGDiv
SGConSGCon
The higher the value, the better the quality.
University at Buffalo The State University of New York
Measurements
Inter-divergence:
')),,'( '
,,
21
21
G
ww
SSGDiv GgSiSi
i
Phenotype Quality
);,1(,, ),'(
),'(),'(
1
jiKjiSSji
ji
ji SSGDiv
SGConSGCon
Intra-consistency
.)(1''
1)','(
' '
2',,
Gg Ss
Sijii j
wwSG
SGCon
University at Buffalo The State University of New York
Phenotype Quality
Data(A) Data(B) Data(C)
Con 4.25 3.44 4.52
Div 41.60 25.20 46.16
14.2687 9.6074 15.3526
Highest phenotype quality
University at Buffalo The State University of New York
Model - Formalized Problem
Input
m samples and n genes
the corresponding gene expression matrix M
the number of phenotypes K
Output
A K-partition of samples (phenotypes) and a subset of genes (informative space) that the phenotype quality is maximized.
University at Buffalo The State University of New York
Strategy
Maintain a candidate phenotype structure and iteratively adjust the candidate structure toward the optimal solution.
Basic elements:
A candidate structure: A partition of samples {S1,S2,…Sk}
A subset of genes G’G
The corresponding phenotype quality An adjustment:
For a gene G’, insert into G’
For a gene G’, remove from G’
For a sample in a group S’, move to other group
The quality gain measures the change of phenotype quality of before and after the adjustment.
ig
ig
is
University at Buffalo The State University of New York
Heuristic Searching
intermediatecandidatestructuregene/sample
Iterative Adjusting
adjustmentΩ > 0
Y
N)
)(exp(
iTp
adjusting
pick upan object
candidate structuregeneration
University at Buffalo The State University of New York
Heuristic Searching Starts with a random K-partition of samples and a subset of genes as
the candidate of the informative space.
Iteratively adjust the partition and the gene set toward a better solution. (Random order of genes and samples.) for each gene, try possible insert/remove for each sample, try best movement.
Insert a gene Remove a gene Move a sample
University at Buffalo The State University of New York
Heuristic Search
For each possible adjustment, compute For each gene, try possible insert/removeFor each sample, try the best movement
> 0 conduct the adjustment < 0 conduct the adjustment with probability
T(i) is a decreasing simulated annealing function
and i is the iteration number. T(0)=1, T(i)=1/(i+1) in our implementation
))(
exp(iT
p
University at Buffalo The State University of New York
Mutual Reinforcing Adjustment - Motivation
Drawbacks of the heuristic searching method: blind initialization , equal chance of samples and genes, noisy samples.
The phenotype quality value of subset of informative genes and partially phenotype should also be high.
Mining phenotypes and informative genes directly from high-dimensional noisy data is difficult, we start from small groups whose data distribution and patterns are much easier to be detected.
Mining of phenotypes and informative genes should
mutually reinforced.
University at Buffalo The State University of New York
Mutual Reinforcing Adjustment - Motivation
A B C
University at Buffalo The State University of New York
Mutual Reinforcing Adjustment - Major Steps
Partition the Matrix: divide the original matrix into a series of exclusive sub-matrices based on partitioning both the samples and genes.
Reference Partition Detection: post a partial or approximate phenotype structure called a reference partition of samples.
compute reference degree for each sample groups;
select k groups of samples;
do partition adjustment.
Gene Adjustment: adjust the candidate informative genes.
compute for reference partition on G
perform possible adjustment of each genes
Refinement Phase
University at Buffalo The State University of New York
Method Detail - Iteration Phasein
form
ativ
e ge
nes
G’
all samples
partitioning the matrix
all samples
info
rmat
ive
gene
s G
’
reference partition detection
reference partition
gene adjustment
info
rmat
ive
gene
s G
’
info
rmat
ive
gene
s G
’’
reference partition
info
rmat
ive
gene
s G
’’ all samples
to next iteration
University at Buffalo The State University of New York
Partitioning the Matrix
Partition the samples and genes into multiple groupsUse CAST
A threshold t decide the size of each groupBased on the Pearson’s correlation Coefficient
Outliers will be filtered out from any groupSamples or genes in the same group share similar
patterns
k
i i
k
i i
k
i iiYX
yyxx
yyxx
1
2
1
2
1,
))((
University at Buffalo The State University of New York
Reference Partition Detection
Select the groups of samples as potential phenotypes
Pick the first group with the highest reference degree
Select the other groups by considering the inter-phenotype divergence w.r.t. selected groups
' ),(
1log)(
GG jijj
iSGCon
SSref
'
1
0
),(
),,(log)(
GG pxi
ptpxx
ti
pxpxi SGCon
SSGDivSSRan
University at Buffalo The State University of New York
Check the Missing Samples
Probabilistically insert the remaining samples not in the selected groups into the most probably matching group
In iterations, use the gene candidate sets to improve the reference partition
University at Buffalo The State University of New York
Gene Adjustment
Gene adjustment: Test the possible adjustments that lead to improvement
Insert a gene Remove a gene
University at Buffalo The State University of New York
The partition corresponding to the best state may not
cover all the samples.
Add every sample not covered by the reference
partition into its matching group the phenotypes of
the samples.
Then, a gene adjustment phase is conducted. We
execute all adjustments with a positive quality gain
informative space.
Time complexity O(n*m2*I)
Method-Refinement Phase
University at Buffalo The State University of New York
gene1
gene6
gene7
gene2
gene4
gene3
samples 1 4 82 3 5 6 7 9 10
gene8
gene9
Output: p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which
manifest the sample partition. The overall phenotype quality is maximized.
Empirical Phenotype Structure
Hidden Phenotype Structure
Mining Multiple Phenotype Structures
University at Buffalo The State University of New York
Maintain p candidate phenotype structures and iteratively adjust them toward the optimal solution.
Basic elements of each candidate structure:
A candidate structure
A Kt partition of samples
A subset of genes G’G
The corresponding phenotype quality t
An adjustment
For a gene gi Gt, insert into Gt
For a gene gi Gt, move from Gt’ (tt’) or remove from all structures
For a sample si in group S’, move to other group
The quality gain measures the change of pattern quality of the states after the adjustment.
Extended Algorithm Strategy
University at Buffalo The State University of New York
The Extended Algorithm (Cont’d) Gene
insert moveremove
Sample
move
candidate structure 1
candidate structure 2
University at Buffalo The State University of New York
Mining Multiple Phenotype Structures (Cont’d)
Partially informative genes
University at Buffalo The State University of New York
Formalized Problem
Input
•m samples and n genes
•the corresponding gene expression matrix M
•the number of phenotype structures p
•the set of numbers {K1, K2, …, Kp}
Output
p phenotype structures where the tth structure is a Kt-partition of samples (phenotypes) and a subset of genes (informative space) which manifest the sample partition. The overall phenotype quality is maximized.
University at Buffalo The State University of New York
The AlgorithmCandidate Structure Generation
cluster genes into p’ group (p’>p) (CAST)
generate sample partitions one by one on clusters of genes, select best quality genes.
Iterative Adjustment
for each gene, try possible insert/move/remove
for each sample,
- examine all possible adjustment
- select best movement.
University at Buffalo The State University of New York
The Algorithm (Cont’d) Gene (p possible adjustments)
insert moveremove
Sample (Kt-1 possible
adjustments for each
partition)
University at Buffalo The State University of New York
The Algorithm (Cont’d)
Data Standardization
the original gene intensity values relative values
,,'
,i
iji
ji
www
1
)(;
1
2,1 ,
m
ww
m
ww
m
j iji
i
m
j ji
i where
Random order of genes and samples
Conduct negative action with a probability
Simulated annealing technique
))(
exp(iT
p
.1
1)(;1)0(
iiTT
University at Buffalo The State University of New York
Experiments
Data Sets:Multiple-sclerosis data
MS-IFN : 4132 * 28 (14 MS vs. 14 IFN)MS-CON : 4132 * 30 (15 MS vs. 15 Control)
Leukemia data7129 * 38 (27 ALL vs. 11 AML)7129 * 34 (20 ALL vs. 14 AML)
Colon cancer data2000 * 62 (22 normal vs. 40 tumor colon tissue)
Hereditary breast cancer data3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)
University at Buffalo The State University of New York
Rand Index
Rand Index -A measurement of “agreement” between the ground-truth (P) and the results (Q) : “a” : the number of pairs of objects that are in
the same class in P and in the same class in Q; “b” : the number of pairs of objects that are in
the same class in P but not in the same class in Q;
“c” : the number of pairs of objects that are in the same class in Q but not in the same class in P;
“d” : the number of pairs of objects that are in different classes in P and in different class in Q.
dcba
daRI
P Q
s1 s2
s1 s2
s1 s2
s1 s2
s1 s2 s1 s2
s1 s2 s1 s2
University at Buffalo The State University of New York
Phenotype Structure Detection
Data Set MS-IFN MS-CON Leukemia-G1 Leukemia-G2 Colon Breast
Data Size 4132*28 4132*30 7129*38 7129*34 2000*62 3226*22
J-Express 0.4815 0.4851 0.5092 0.4965 0.4939 0.4112
CLUTO 0.4815 0.4828 0.5775 0.4866 0.4966 0.6364
CIT 0.4841 0.4851 0.6586 0.4920 0.4966 0.5844
CNIO 0.4815 0.4920 0.6017 0.4920 0.4939 0.4112
CLUSFAVOR 0.5238 0.5402 0.5092 0.4920 0.4939 0.5844
-cluster 0.4894 0.4851 0.5007 0.4538 0.4796 0.4719
Heuristic 0.8052 0.6230 0.9761 0.7086 0.6293 0.8638
Mutual 0.8387 0.6513 0.9778 0.7558 0.6827 0.8749
University at Buffalo The State University of New York
ExperimentsNumber of iterations Running time
Data Size mean standard deviation
mean standard deviation
4132*28 158 27.2 180 35.1
4132*30 168 29.5 195 37.8
7129*38 171 16.1 436 51.9
7129*34 198 35.9 458 101.2
2000*62 133 17.8 479 98.5
3226*22 157 22.2 167 35.6
The mean value and standard deviation of the numbers of iterations and response time (in second) with respect to the matrix size.
University at Buffalo The State University of New York
Experimental Results (5)Phenotype Structure Detection (Cont’d)
The mutual reinforcing approach as applied to the MS-IFN group.
(A) shows the distribution of the original 28 samples. Each point represents a sample with 4132 genes mapped to two-dimensional space.
(B) shows the distribution in the middle of the adjustment.
(C) shows the distribution of the same 28 samples after the iterations. 76 genes was selected as informative space.
University at Buffalo The State University of New York
Experimental Results (5)Informative Gene Selection
University at Buffalo The State University of New York
Phenotype Structures
University at Buffalo The State University of New York
Experimental Results (5)Informative Gene Selection (Cont’d)
University at Buffalo The State University of New York
Experimental Results (5)Scalability Evaluation
University at Buffalo The State University of New York
Conclusion from the Experiments
The work is motivated by the needs of emerging microarray data analysis.
The strategy is designed for data which have the following properties:The number of samples is limited but the gene
dimension is very large.Large volumes of irrelevant and redundant genes
prevent accurate grouping of samples;Analyzing over one dimension object can enhance
detecting meaningful patterns of another dimension.