Nonstandard Machine Learning Algorithmsfor
Microarray Data Mining
Byoung-Tak Zhang
Center for Bioinformation Technology (CBIT) &
Biointelligence Laboratory
School of Computer Science and Engineering
Seoul National University
http://cbit.snu.ac.kr/
http://bi.snu.ac.kr/
SNU Center for Bioinformation Technology (CBIT)
2
Outline
! Microarray Bioinformatics
! Probabilistic Graphical Models for GeneExpression Analysis
! Gene-Drug Dependency Analysis withBayesian Networks
! Gene and Tissue Clustering with LatentVariable Models
! Summary and Future Work
Microarray Bioinformatics
SNU Center for Bioinformation Technology (CBIT)
4
Molecular Biology: Central Dogma
SNU Center for Bioinformation Technology (CBIT)
5
Reductionistic and SyntheticApproaches in Biology
Biological System
(Organism)
Building Blocks
(Genes/Molecules)
Synthetic
Approach
(Bioinformatics)
Reductionistic
Approach
(Experiments)
SNU Center for Bioinformation Technology (CBIT)
6
Topics in Bioinformatics
Structure analysis4 Protein structure comparison4 Protein structure prediction4 RNA structure modeling
Pathway analysis4 Metabolic pathway4 Regulatory networks
Sequence analysis4 Sequence alignment4 Structure and function prediction4 Gene finding
Expression analysis4 Gene expression analysis4 Gene clustering
SNU Center for Bioinformation Technology (CBIT)
7
Applications of DNA Microarrays
! Gene discovery
! Analysis of gene regulation
! Disease diagnosis
! Drug discovery: pharmacogenomics
! Toxicological research: toxicogenomics
SNU Center for Bioinformation Technology (CBIT)
8
A ComparativeHybridization Experiment
Imageanalysis
SNU Center for Bioinformation Technology (CBIT)
9
Image Analysis
! Scanned images " probe intensities " numerical valuesfor higher-level analysis
Array target segmentation
Background intensityextraction Target detection
Target intensity extractionRatio analysis
SNU Center for Bioinformation Technology (CBIT)
10
Data Preparation for Data Mining
Sample 1 Sample 2 Sample i
Sample k Sample n
<Microarray image samples>
Sample 1
Gene 2
<Numerical data for data mining>
Imageanalysis
SNU Center for Bioinformation Technology (CBIT)
11
Gene Expression Data Mining
Data preprocessing:
- Normalization
- Discretization
- Gene selectionLearning:
- Greedy search
- EM algorithm
…
…
Mining
- Classification
- Clustering
- Regulation analysis
SNU Center for Bioinformation Technology (CBIT)
12
Why Data Mining?
! Traditional analysis♦ One gene in one experiment
♦ Small data sets
♦ Simple analytical methods
! High-throughput genomicanalysis♦ Simultaneous measurements of
thousands of gene expression levels" massive data sets
♦ Statistical methods
♦ Machine learning approach
SNU Center for Bioinformation Technology (CBIT)
13
An Example of Data Mining: Clustering
SNU Center for Bioinformation Technology (CBIT)
14
Analysis of DNA Microarray DataPrevious Work
! Characteristics of data♦ Analysis of expression ratio based on each sample♦ Analysis of time-variant data
! Clustering♦ Self-organizing maps [Golub et al., 1999]♦ Singular value decomposition [Orly Alter et al., 2000]
! Classification♦ Support vector machines [Brown et al., 2000]
! Gene identification♦ Information theory [Stefanie et al., 2000]
! Gene modeling♦ Bayesian networks [Friedman et.al., 2000]
Probabilistic Graphical Models forGene Expression Analysis
SNU Center for Bioinformation Technology (CBIT)
16
An Introductory Example
! A Bayesian network classifier for acute leukemias [Hwanget al. 2001]
Zyxin
Leukemia
MB-1
C-mybLTC4S
<Network structure>
MB-1 geneMB-1
C-myb gene extractedfrom Human (c-myb)gene, complete primarycds, and five completealternatively splicedcds
C-myb
Leukotriene C4synthase (LTC4S) gene
LTC4S
ALL or AMLLeukemia
ZyxinZyxin
∏ == n
i iiXPP1
)|()( PaX
SNU Center for Bioinformation Technology (CBIT)
17
Probabilistic Graphical Models
! The joint probability distribution over X = {X1, X2, …, Xn}♦ Chain rule
! Conditional independence X⊥ Y | Z
! Efficient representation of the joint probabilitydistribution using conditional independencies encoded bythe graph (network) structure
♦ Xi ⊥ {X1,…,Xi-1} - V(Xi) | V(Xi)
∏ = −= n
i ii XXXPP1 11 ),...,|()(X
)|(),|( ZXZYX PP =
∏∏
=
= −
=
=n
i ii
n
i ii
XVXP
XXXPP
1
1 11
))(|(
),...,|()(X
SNU Center for Bioinformation Technology (CBIT)
18
Analysis Procedure
Gene ExpressionData
Discretizationand Selection
LearningGraphical Models
Gene B
TargetGene D
Gene CGene A
Selected genes
and the target variable
Gene B
Target
Gene D
Gene CGene A
<Learned Graphical Models>
- Classification
- Dependency analysis
- Clustering
SNU Center for Bioinformation Technology (CBIT)
19
Why Probabilistic Graphical Models?
! The joint probability distribution " all the knowledgeabout the (biological) system.
! Generative " modeling data sources! Efficient probabilistic inference " prediction
! The network structure " an insight into the intricaterelationships between components (i.e., genes) of thebiological system " dependency analysis
! Robust to the noise and error in gene expression data
SNU Center for Bioinformation Technology (CBIT)
20
Classes of Graphical Models
Graphical Models
- Boltzmann Machines- Markov Random Fields
- Bayesian Networks- Latent Variable Models- Hidden Markov Models- Generative Topographic Mapping- Non-negative Matrix Factorization
Undirected Directed
Gene-Drug Dependency Analysiswith Bayesian Networks
SNU Center for Bioinformation Technology (CBIT)
22
NCI Drug Discovery Program
NCI 60cell linesdata set
SNU Center for Bioinformation Technology (CBIT)
23
NCI 60 Cell Lines Data Set
! 60 human cancer cell lines♦ Colorectal, renal, ovarian, breast, prostate, lung, and central
nervous system origin cancers, as well as leukemias andmelanomas.
! Drug activity patterns (Database A)♦ Sulphorhodamine B assay " changes in total cellular protein
after 48 hours of drug treatment.
! Individual targets (matrix Ti)♦ Analysis of molecular characteristics other than mRNA
expressions.
! This study focuses on♦ Sensitivity to therapy #" not molecular consequences of
therapy
SNU Center for Bioinformation Technology (CBIT)
24
Gene-Drug Dependency Analysis UsingBayesian Networks
! To discover♦ Gene-gene expression dependency
♦ Gene expression-drug activity dependency
♦ Drug-drug activity dependency
Drug BDrug A
Gene A Drug C
Gene B
Local probabilitydistribution
SNU Center for Bioinformation Technology (CBIT)
25
Bayesian Networks: the Qualitative Part
! The directed-acyclic graph (DAG) structure♦ Conditional independencies between variables " dependency
analysis
Gene AGene E
Gene B Gene D
Gene C
<Conditional independencies>
- Gene B and Gene D are independentgiven Gene A.
- Gene B asserts dependency between GeneA and Gene E.
- Gene A and Gene C are independentgiven Gene B.
<The DAG structure>
SNU Center for Bioinformation Technology (CBIT)
26
Bayesian Networks:the Quantitative Part
! The joint probability distribution over all the variables ina Bayesian network.
∏ == n
i iiXPP1
)|()( PaX
)|()|(),|()()(
),,,|(),,|(),|()|()(
),,,,(
BCPADPEABPEPAP
DBEACPBEADPEABPAEPAP
EDCBAP
==
Gene AGene E
Gene B Gene D
Gene C
Local probabilitydistribution for Xi
ii
ii
ijrijijij
iiiqii
Xr
q
P
XP
i
i
forstatesof#:
forionsconfiguratof#:
),...,|(Dir)(
)|(forparameter~),...,(
1
1
Pa
Pa
ααθθθθ
=
=Θ
SNU Center for Bioinformation Technology (CBIT)
27
Applications of Bayesian Networks
! Classification♦ Bayesian network classifiers
♦ Probabilistic inference " prediction " cancer classificationbased on gene expression
! Dependency analysis♦ Bayesian network structure " dependency between components
" putative causal relationships
SNU Center for Bioinformation Technology (CBIT)
28
Discretizaion of Gene Expression Levels
! The local probability distribution for each variable♦ Multinomial distribution with Dirichlet prior.
! Discretization of gene expression levels and drugactivities♦ All values " 0 and 1 according to the mean values across the
training samples.
structurenetworkthe:
)()1(,1)1(
)(
)...(
),...,,|(Dir)|(
1
1
1
1
21
h
r
kkr
k ijk
ijrij
ijrijijijh
ij
S
xxx
Sp
iijk
i
i
i
Γ=+Γ=Γ
⋅Γ
++Γ=
=
∏∏ =
−
=
αθα
αααααθθ
SNU Center for Bioinformation Technology (CBIT)
29
Selection ofGenes and Drugs for Analysis
Known drugs
Missing valueselimination
- 566 human genesand ESTs
- 82 drugs
Learning Bayesiannetworks with 648nodes
SNU Center for Bioinformation Technology (CBIT)
30
Learn the Bayesian Network from Data
! Two approaches to Bayesian network learning♦ Dependency analysis-based approach
♦ Optimization-based approach!Network score: fitness of the network to training data D.
!Search the best-scoring network structure
! Scoring metric for the network♦ MDL (minimum description length) score
♦ BD (Bayesian Dirichlet) score
SNU Center for Bioinformation Technology (CBIT)
31
BD (Bayesian Dirichlet) Score
! Assumptions♦ Multinomial sample♦ Parameter independence and modularity♦ Dirichlet prior
♦ Complete data
! BD score is calculated as follows:
♦ S: the network structure, D: the training data
∏ =
−⋅= i ijkr
k ijkh
ij cSp1
1)|( αθθ
.)(
)(
)(
)()(
)|()(),(
1 1 1∏ ∏ ∏= = = Γ+Γ
+ΓΓ
⋅=
⋅=n
i
q
j
r
kijk
ijkijk
ijij
iji iN
NSp
SDpSpSDp
αα
αα
Prior probability
Sufficient statisticscalculated from D
SNU Center for Bioinformation Technology (CBIT)
32
Search Strategy
! Finding the best network structure " NP-hard problem.♦ Exponential time complexity.♦ In general, greedy search or simulated annealing is used.
! General greedy search algorithm is not applicable to theconstruction of large-scale Bayesian networks withhundreds of nodes.♦ Time and space complexity
! Reduce the search space♦ “Sparse candidate algorithm” by [Friedman et al. 1999]♦ “Local to global search algorithm”
!Exploit the local search for the Markov blanket of each node toreduce the global search space.
SNU Center for Bioinformation Technology (CBIT)
33
Local to Global Search Algorithm
������
� ������������
� ��������� ��������������������������
� �����������������������������
������� �� �������������������������
�� ����� ��������…����������������
���� ����� �����
�� ����� �� � ��� ��–��� ������� ���� ��� � ��������� ���� �� ���� ����!"���
�"� ≤ �#� ���
���������$������������������
�� %��� ��� � ���� &���� ����'�� ����� ���� ������ ���������� ��� ��������� � �� $������
�����������������!��#�������� ������������������
��$���������� ���������������������������!&�������!��#'���#���������������������
����������� !��������������#
������ ����� �����
��%���� �� �������������������������� ⊂ ���� �� ���(���)����� ��!�����#����
�����������������������������
.)),(|(),( ∑=i i
Bi DXPaXScoreDBScore
SNU Center for Bioinformation Technology (CBIT)
34
Learning Curve
BDscore
# of greedy search steps
Learning takesabout 8 minutes.
SNU Center for Bioinformation Technology (CBIT)
35
Drug-Drug Dependency
! The pyramidine analogues♦ Aphidicolin-glycinate
♦ Floxuridine
♦ Cyclocytidine
♦ Cytarabine
! Shows similar activity pattern across 60 cancer cell lines
! Clustered together in [Scherf et al. 2000].
DNA synthesis inhibitor
SNU Center for Bioinformation Technology (CBIT)
36
Experimental Results:Drug-Drug Dependency
! Drug-drug activity correlations
<Part of the learned Bayesian network structure>
- Three drugs “Aphidicolin-glycinate”, “Floxuridine”, and“Cytarabine” directly depend oneach other.
- “Cyclocytidine” directlydepends on “Cytarabine” and viceversa.
Confirmation
SNU Center for Bioinformation Technology (CBIT)
37
Gene-Drug Dependency
! Gene “ASNS”♦ Certain malignant cells, including many acute lymphoblastic
leukemias (ALL) lack asparagine synthetase (ASNS) "exogenous “L-asparagine” [Scherf et al. 2000].
Sensitivity of the drug“L-asparagine”
Expression levels of thegene “ASNS”
High negative correlation
SNU Center for Bioinformation Technology (CBIT)
38
Experimental Results:Gene-Drug Dependency
! Gene expression-drug activity correlations
- The negative correlationbetween “ASNS” and “L-asparagine” is mediated by twoother genes.
- The relationships revealed bythe Bayesian network is putativeand should be verified bybiological experiments. "exploratory analysis
`
<Part of the learned Bayesian network structure>
Confirmation
Discovery
Gene and Tissue Clusteringwith Latent Variable Models
SNU Center for Bioinformation Technology (CBIT)
40
CAMDA-2000 Data Set
! Gene expression data forcancer prediction♦ Training data: 38 leukemia
samples (27 ALL , 11AML)
♦ Test data: 34 leukemiasamples (20 ALL , 14AML)
♦ Datasets containmeasurementscorresponding to ALL andAML samples from bonemarrow and peripheralblood.
SNU Center for Bioinformation Technology (CBIT)
41
Cluster Analysis UsingNon-negative Matrix Factorization! Method
♦ Using NMF for class clustering and prediction of gene expression datafrom acute leukemia patients
! NMF (non-negative matrix factorization)
∑=
=≈
≈r
aaiaii HW
1
)()( µµµ WHG
WHG
G ��gene expression data matrix
W ��basis matrix (prototypes)
H ��encoding matrix (in low
dimension)
0,, ≥µµ aiai HWG
! NMF as a latent variable model
…
…
h1 hr
g1 g2 gn
W
Whg >=<
h2
SNU Center for Bioinformation Technology (CBIT)
42
Clustering Gene Expression Data
…�
�
�
�
�
�
�
�
�
�
�
�
7,129genes
38 samples
�
�
�
�
�
2 factors
… encoding
38 samples7,129genes
G W(?) H(?)
! Factors can capture the correlations between the genes using the valuesof expression level.
! Cluster training samples into 2 groups by NMF♦ Assign each sample to the factor (class) which has higher encoding value.
…
H1·
g1 g2 g7,129
W
H2 ·
g3 g4
SNU Center for Bioinformation Technology (CBIT)
43
Learning Procedure
Input : Gene expression data matrix, G (n × m)
Output : base matrix W (n × k), encoding matrix H (k × m)
n: data size, m: number of genes, k: number of latent variables
Objective function :
Procedure
1. Initialize W, H with random numbers.
2. Update W, H iteratively until max_iteration or some criterion is met.
0
1,0
≥
=≥ ∑
ij
jijij
H
WW
[ ]∑∑= =
−=n
i
m
iii WHWHGF1 1
)()log(µ
µµµ
∑=i i
iiaaa WH
GWHH
µ
µµµ )(
∑
∑
←
←
jja
iaia
ai
iiaia
W
WW
HWH
GWW µ
µ µ
µ
)(
SNU Center for Bioinformation Technology (CBIT)
44
Learning Curve
������������
� �� � �� �� �� �� �� �� �� ���
������������������
������������
��������������
SNU Center for Bioinformation Technology (CBIT)
45
Experimental Results:Cancer Clustering
�
�
�
�
�
�
�
�
� � � � � � �
�
Accuracy: 0 ~1 error for the training data set
SNU Center for Bioinformation Technology (CBIT)
46
Diagnosis Using NMF
! For each test sample g, estimate the encoding vector h that bestapproximates the sample.
♦ W is the basis matrix computed duringtraining (fixed).
♦ As in training, assign each sample to the factor (class) which has thehighest encoding value.
! Accuracy: 1~2 error(s) for the test data set
�
�
�
�
2 factors
��
�
W
h(?)
g
7,129genes
7,129genes
…
h1
g1 g2 g7,129
W
h2
g3 g4
SNU Center for Bioinformation Technology (CBIT)
47
Clustering UsingGenerative Topographic Mapping
! GTM: a nonlinear, parametric mapping y(x;W)
from a latent space to a data space.
x2
x1
y�x�W�
t1
t3
t2
Grid
<Latent space> <Data space>
SNU Center for Bioinformation Technology (CBIT)
48
LearningGenerative Topographic Mapping! Learning algorithm
!Generate the grid of latent points.
!Generate the grid of latent function centers.
!Compute the matrix of basis function activations ΦΦΦΦ.
!Initialize weights W in Y = ΦΦΦΦW and the noise variance β.
!Compute ∆n,k = ||tn – ΦkW||2 = ||tn – yk(x,W)||2 for each n, k
!Repeat
4 - Compute the responsibility matrix R using ∆ and β. [E-Step]
4 Compute G=RTR
4 - Update W by ΦΦΦΦTGΦΦΦΦW= ΦΦΦΦTRT [M-step]
4 - Compute ∆ = ||t – ΦΦΦΦW||2
4 - Update β
!Until convergence
SNU Center for Bioinformation Technology (CBIT)
49
GTM: Visualization
! Posterior distribution in latent space given a data point t:
X(t) ~
! For a whole set of data: for each t, plot in the latent space
♦ Posterior mode:
♦ Posterior mean:
SNU Center for Bioinformation Technology (CBIT)
50
GTM: Clustering Experiment
! Gene Selection♦ Select about 50 genes out of 7,129 based on the three test
scores of cancer diagnosis.!Correlation metric (similar as t-test)
!Wilcoxon test scores (a nonparametric t-test)
!Median test scores (a nonparametric t-test)
! Clustering & Visualization♦ After learning a model, genes are plotted in the latent space.
♦ With the mapping in the latent space, clusters can beidentified.
SNU Center for Bioinformation Technology (CBIT)
51
List of Genes Selected
SNU Center for Bioinformation Technology (CBIT)
52
GTM: Learning Curve
SNU Center for Bioinformation Technology (CBIT)
53
GTM: Clustering Result
Genes with high expressionlevels in case of ALL(large P-metric value)
Genes with high expressionlevels in case of AML(negative large P-merticvalue)
SNU Center for Bioinformation Technology (CBIT)
54
Experimental Results:Clusters Found by GTM
! Three cell cycle-regulated clusters found by GTM [Shin etal. 2000]
(.894 .907 -.766 -.479)10 / 16 (62%)
0 / 16
35 / 18
/ 7
(-0.111 0.333)
(-0.111 0.111)
G1 c1
c2
(-.616 –1.01 1.832 1.596)0 / 5
3 / 5 (80%)
10 / 5
/ 3
(0.111 0.333)
(0.111 0.111)
G2/M c1
c2
(-.171 -.573 .091 .311)1 / 6
0 / 6
0 / 6
13 / 7
/ 2
/ 2
(0.111 0.333)
(-0.111 –0.111)
(0.323 0.1)
M/G1 c1
c2
c3
(1.075 1.482 -.233 -.375)5 / 5 (100%)5 / 5(0.111 –0.333)S
(.148 .184 -.367 -.044)1 / 25 /S/G2
Overall mean expression levels(Cln/b) of known genes
Correct no. / testdata
No. of train
Data/ no. incluster
Cluster center
SNU Center for Bioinformation Technology (CBIT)
55
Experimental Results:Comparison with Other Methods
! Comparison of prototype expression levels
(.66 .49 -.55 -.33)300
(total = 800)
(.92 .74 -.62 -.33)
(.79 .82 -.48 -.34)
122
74
(total = 570)
G1 c1
c2
(-.32 -.62 .49 .54)195(-.59 -.96 1.34 1.29)
(.08 -.30 .51 .57)
33
60
G2/M c1
c2
(-.21 -.61 -.04 .07)113(.82 .65 -.65 -.38)
(-.04 -.37 -.01 -.11)
(.32 .29 -.3 .05)
120
34
10
M/G1 c1
c2
c3
(.46 .47 -.43 -.18)71(.84 .81 -.42 -.33)25S
(.13 .05 -.16 .03)121(.13 -.06 -.1 .01)92S/G2
Mean expression
levels by Spellman
No. of selectedgenes by
Spellman
Mean expression
levels by GTM
No. ofselectedgenes
Conclusion
SNU Center for Bioinformation Technology (CBIT)
57
Summary and Future Work
! A class of nonstandard learning algorithms, i.e. probabilisticgraphical models, is presented for microarray data analysis:♦ Bayesian networks♦ Non-negative matrix factorization♦ Generative topographic mapping
! Probabilistic graphical models are useful for biological data mining:♦ Comprehensibility♦ Generative♦ Soft inference (clustering)♦ Dependency analysis
! Future work includes:♦ Construction of larger-scale networks♦ Acceleration of learning processes♦ Handling data sparseness and missing data problems
SNU Center for Bioinformation Technology (CBIT)
58
References
! [Chang et al. 2001] Chang, J.-H., Hwang, K.-B., Zhang, B.-T., Analysisof gene expression profiles and drug activity patterns for the molecularpharmacology of cancer, In Proceedings of CAMDA’01, 2001 (to appear).
! [Friedman et al. 1999] Friedman, N., Nachman, I., and Pe’er, D.,Learning Bayesian network structure from massive datasets: the “sparsecandidate” algorithm, In Proceedings of UAI’99, 1999.
! [Hwang et al. 2001] Hwang, K.-B., Cho, D.-Y., Park, S.-W., Kim, S.-D.,and Zhang, B.-T., Applying machine learning techniques to analysis ofgene expression data: cancer diagnosis, Methods of Microarray DataAnalysis, 167-182, 2001.
! [Scherf et al. 2000] Scherf, U. et al., A gene expression database for themolecular pharmacology of cancer, Nature Genetics, 24: 236-244, 2000.
! [Shin et al. 2000] Shin, H.-J., Chang, J.-H., Yang, J.-S., Zhang, B.-T.,and Augh, S.-J., Probabilistic models for clustering cell cycle-regulatedgenes in the Yeast, In Proceedings of CAMDA’00, 2000.
SNU Center for Bioinformation Technology (CBIT)
59
Acknowledgements
CAMDA-2000 & CAMDA-2001
! Sirk-June Augh (Molecular Biology)
! Jeong-Ho Chang (HelmholtzMachines, NMF, PLSA)
! Dong-Yeon Cho (Bayesian Evolution,Neural Trees)
! Kyu-Baek Hwang (BayesianNetworks)
! Sung-Dong Kim (HierarchicalClustering)
! Sang-Wook Park (RBF Networks)
! Hyung-Joo Shin (Probabilistic LSA)
! Jin-San Yang (Statistics, GTM, SOM)
! Ho-Jean Chung (Statistical Learning)! Seung-Woo Chung (ICA, PCA)! Jae-Hong Eom (Hidden Markov Models)! Tae-Jin Jeong (Reinforcement Learning)! Je-Gun Joung (Evolution, Neural Trees)! Chul-Joo Kang (Molecular Biology)! Jun-Shik Kim (Statistical Physics)! Sun Kim (Evolutionary Algorithms)! Yoo-Hwan Kim (ICA, Naïve Bayes, Boosting)! In-Hee Lee (Evolutionary Computation)! Jae-Won Lee (Reinforcement Learning)! Jong-Woo Lee (Latent Variable Models)! Si-Eun Lee (Bayesian Evolution)! Seung-Joon Lee (Reinforcement Learning)! Jong-Yoon Lim (SOM, Graphical Models)! Hyun-Koo Moon (HMM, Naïve Bayes)! Jang-Min O (Support Vector Machines, Boosting)! Seong-Bae Park (Decision Trees)! Hyung-Joo Shin (Probabilistic LSA)! Soo-Yong Shin (Evolution, Helmholtz Machines)
SNU Center for Bioinformation Technology (CBIT)
60
More informationat
http://cbit.snu.ac.kr/http://bi.snu.ac.kr/