Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering...

Nonstandard Machine Learning Algorithmsfor

Microarray Data Mining

Byoung-Tak Zhang

Center for Bioinformation Technology (CBIT) &

Biointelligence Laboratory

School of Computer Science and Engineering

Seoul National University

http://cbit.snu.ac.kr/

http://bi.snu.ac.kr/

SNU Center for Bioinformation Technology (CBIT)

2

Outline

! Microarray Bioinformatics

! Probabilistic Graphical Models for GeneExpression Analysis

! Gene-Drug Dependency Analysis withBayesian Networks

! Gene and Tissue Clustering with LatentVariable Models

! Summary and Future Work

Microarray Bioinformatics


4

Molecular Biology: Central Dogma


5

Reductionistic and SyntheticApproaches in Biology

Biological System

(Organism)

Building Blocks

(Genes/Molecules)

Synthetic

Approach

(Bioinformatics)

Reductionistic

Approach

(Experiments)


6

Topics in Bioinformatics

Structure analysis4 Protein structure comparison4 Protein structure prediction4 RNA structure modeling

Pathway analysis4 Metabolic pathway4 Regulatory networks

Sequence analysis4 Sequence alignment4 Structure and function prediction4 Gene finding

Expression analysis4 Gene expression analysis4 Gene clustering


7

Applications of DNA Microarrays

! Gene discovery

! Analysis of gene regulation

! Disease diagnosis

! Drug discovery: pharmacogenomics

! Toxicological research: toxicogenomics


8

A ComparativeHybridization Experiment

Imageanalysis


9

Image Analysis

! Scanned images " probe intensities " numerical valuesfor higher-level analysis

Array target segmentation

Background intensityextraction Target detection

Target intensity extractionRatio analysis


10

Data Preparation for Data Mining

Sample 1 Sample 2 Sample i

Sample k Sample n

<Microarray image samples>

Sample 1

Gene 2

<Numerical data for data mining>

Imageanalysis


11

Gene Expression Data Mining

Data preprocessing:

- Normalization

- Discretization

- Gene selectionLearning:

- Greedy search

- EM algorithm

…

…

Mining

- Classification

- Clustering

- Regulation analysis


12

Why Data Mining?

! Traditional analysis♦ One gene in one experiment

♦ Small data sets

♦ Simple analytical methods

! High-throughput genomicanalysis♦ Simultaneous measurements of

thousands of gene expression levels" massive data sets

♦ Statistical methods

♦ Machine learning approach


13

An Example of Data Mining: Clustering


14

Analysis of DNA Microarray DataPrevious Work

! Characteristics of data♦ Analysis of expression ratio based on each sample♦ Analysis of time-variant data

! Clustering♦ Self-organizing maps [Golub et al., 1999]♦ Singular value decomposition [Orly Alter et al., 2000]

! Classification♦ Support vector machines [Brown et al., 2000]

! Gene identification♦ Information theory [Stefanie et al., 2000]

! Gene modeling♦ Bayesian networks [Friedman et.al., 2000]

Probabilistic Graphical Models forGene Expression Analysis


16

An Introductory Example

! A Bayesian network classifier for acute leukemias [Hwanget al. 2001]

Zyxin

Leukemia

MB-1

C-mybLTC4S

<Network structure>

MB-1 geneMB-1

C-myb gene extractedfrom Human (c-myb)gene, complete primarycds, and five completealternatively splicedcds

C-myb

Leukotriene C4synthase (LTC4S) gene

LTC4S

ALL or AMLLeukemia

ZyxinZyxin

∏ == n

i iiXPP1

)|()( PaX


17

Probabilistic Graphical Models

! The joint probability distribution over X = {X1, X2, …, Xn}♦ Chain rule

! Conditional independence X⊥ Y | Z

! Efficient representation of the joint probabilitydistribution using conditional independencies encoded bythe graph (network) structure

♦ Xi ⊥ {X1,…,Xi-1} - V(Xi) | V(Xi)

∏ = −= n

i ii XXXPP1 11 ),...,|()(X

)|(),|( ZXZYX PP =

∏∏

=

= −

=

=n

i ii

n

i ii

XVXP

XXXPP

1

1 11

))(|(

),...,|()(X


18

Analysis Procedure

Gene ExpressionData

Discretizationand Selection

LearningGraphical Models

Gene B

TargetGene D

Gene CGene A

Selected genes

and the target variable

Gene B

Target

Gene D

Gene CGene A

<Learned Graphical Models>

- Classification

- Dependency analysis

- Clustering


19

Why Probabilistic Graphical Models?

! The joint probability distribution " all the knowledgeabout the (biological) system.

! Generative " modeling data sources! Efficient probabilistic inference " prediction

! The network structure " an insight into the intricaterelationships between components (i.e., genes) of thebiological system " dependency analysis

! Robust to the noise and error in gene expression data


20

Classes of Graphical Models

Graphical Models

- Boltzmann Machines- Markov Random Fields

- Bayesian Networks- Latent Variable Models- Hidden Markov Models- Generative Topographic Mapping- Non-negative Matrix Factorization

Undirected Directed

Gene-Drug Dependency Analysiswith Bayesian Networks


22

NCI Drug Discovery Program

NCI 60cell linesdata set


23

NCI 60 Cell Lines Data Set

! 60 human cancer cell lines♦ Colorectal, renal, ovarian, breast, prostate, lung, and central

nervous system origin cancers, as well as leukemias andmelanomas.

! Drug activity patterns (Database A)♦ Sulphorhodamine B assay " changes in total cellular protein

after 48 hours of drug treatment.

! Individual targets (matrix Ti)♦ Analysis of molecular characteristics other than mRNA

expressions.

! This study focuses on♦ Sensitivity to therapy #" not molecular consequences of

therapy


24

Gene-Drug Dependency Analysis UsingBayesian Networks

! To discover♦ Gene-gene expression dependency

♦ Gene expression-drug activity dependency

♦ Drug-drug activity dependency

Drug BDrug A

Gene A Drug C

Gene B

Local probabilitydistribution


25

Bayesian Networks: the Qualitative Part

! The directed-acyclic graph (DAG) structure♦ Conditional independencies between variables " dependency

analysis

Gene AGene E

Gene B Gene D

Gene C

<Conditional independencies>

- Gene B and Gene D are independentgiven Gene A.

- Gene B asserts dependency between GeneA and Gene E.

- Gene A and Gene C are independentgiven Gene B.

<The DAG structure>


26

Bayesian Networks:the Quantitative Part

! The joint probability distribution over all the variables ina Bayesian network.

∏ == n

i iiXPP1

)|()( PaX

)|()|(),|()()(

),,,|(),,|(),|()|()(

),,,,(

BCPADPEABPEPAP

DBEACPBEADPEABPAEPAP

EDCBAP

==

Gene AGene E

Gene B Gene D

Gene C

Local probabilitydistribution for Xi

ii

ii

ijrijijij

iiiqii

Xr

q

P

XP

i

i

forstatesof#:

forionsconfiguratof#:

),...,|(Dir)(

)|(forparameter~),...,(

1

1

Pa

Pa

ααθθθθ

=

=Θ


27

Applications of Bayesian Networks

! Classification♦ Bayesian network classifiers

♦ Probabilistic inference " prediction " cancer classificationbased on gene expression

! Dependency analysis♦ Bayesian network structure " dependency between components

" putative causal relationships


28

Discretizaion of Gene Expression Levels

! The local probability distribution for each variable♦ Multinomial distribution with Dirichlet prior.

! Discretization of gene expression levels and drugactivities♦ All values " 0 and 1 according to the mean values across the

training samples.

structurenetworkthe:

)()1(,1)1(

)(

)...(

),...,,|(Dir)|(

1

1

1

1

21

h

r

kkr

k ijk

ijrij

ijrijijijh

ij

S

xxx

Sp

iijk

i

i

i

Γ=+Γ=Γ

⋅Γ

++Γ=

=

∏∏ =

−

=

αθα

αααααθθ


29

Selection ofGenes and Drugs for Analysis

Known drugs

Missing valueselimination

- 566 human genesand ESTs

- 82 drugs

Learning Bayesiannetworks with 648nodes


30

Learn the Bayesian Network from Data

! Two approaches to Bayesian network learning♦ Dependency analysis-based approach

♦ Optimization-based approach!Network score: fitness of the network to training data D.

!Search the best-scoring network structure

! Scoring metric for the network♦ MDL (minimum description length) score

♦ BD (Bayesian Dirichlet) score


31

BD (Bayesian Dirichlet) Score

! Assumptions♦ Multinomial sample♦ Parameter independence and modularity♦ Dirichlet prior

♦ Complete data

! BD score is calculated as follows:

♦ S: the network structure, D: the training data

∏ =

−⋅= i ijkr

k ijkh

ij cSp1

1)|( αθθ

.)(

)(

)(

)()(

)|()(),(

1 1 1∏ ∏ ∏= = = Γ+Γ

+ΓΓ

⋅=

⋅=n

i

q

j

r

kijk

ijkijk

ijij

iji iN

NSp

SDpSpSDp

αα

αα

Prior probability

Sufficient statisticscalculated from D


32

Search Strategy

! Finding the best network structure " NP-hard problem.♦ Exponential time complexity.♦ In general, greedy search or simulated annealing is used.

! General greedy search algorithm is not applicable to theconstruction of large-scale Bayesian networks withhundreds of nodes.♦ Time and space complexity

! Reduce the search space♦ “Sparse candidate algorithm” by [Friedman et al. 1999]♦ “Local to global search algorithm”

!Exploit the local search for the Markov blanket of each node toreduce the global search space.


33

Local to Global Search Algorithm

��

� ��

� ��

� ��

��

�� …��

��

�� –�� !"��

�"� ≤ �#� ��

��$��

�� %�� &�� '�� $��

��!��#��

��$�� !&��!��#'��#��

�� !��#

��

��%�� ⊂ �� (��)�� !��#��

��

.)),(|(),( ∑=i i

Bi DXPaXScoreDBScore


34

Learning Curve

BDscore

# of greedy search steps

Learning takesabout 8 minutes.


35

Drug-Drug Dependency

! The pyramidine analogues♦ Aphidicolin-glycinate

♦ Floxuridine

♦ Cyclocytidine

♦ Cytarabine

! Shows similar activity pattern across 60 cancer cell lines

! Clustered together in [Scherf et al. 2000].

DNA synthesis inhibitor


36

Experimental Results:Drug-Drug Dependency

! Drug-drug activity correlations

<Part of the learned Bayesian network structure>

- Three drugs “Aphidicolin-glycinate”, “Floxuridine”, and“Cytarabine” directly depend oneach other.

- “Cyclocytidine” directlydepends on “Cytarabine” and viceversa.

Confirmation


37

Gene-Drug Dependency

! Gene “ASNS”♦ Certain malignant cells, including many acute lymphoblastic

leukemias (ALL) lack asparagine synthetase (ASNS) "exogenous “L-asparagine” [Scherf et al. 2000].

Sensitivity of the drug“L-asparagine”

Expression levels of thegene “ASNS”

High negative correlation


38

Experimental Results:Gene-Drug Dependency

! Gene expression-drug activity correlations

- The negative correlationbetween “ASNS” and “L-asparagine” is mediated by twoother genes.

- The relationships revealed bythe Bayesian network is putativeand should be verified bybiological experiments. "exploratory analysis

`

<Part of the learned Bayesian network structure>

Confirmation

Discovery

Gene and Tissue Clusteringwith Latent Variable Models


40

CAMDA-2000 Data Set

! Gene expression data forcancer prediction♦ Training data: 38 leukemia

samples (27 ALL , 11AML)

♦ Test data: 34 leukemiasamples (20 ALL , 14AML)

♦ Datasets containmeasurementscorresponding to ALL andAML samples from bonemarrow and peripheralblood.


41

Cluster Analysis UsingNon-negative Matrix Factorization! Method

♦ Using NMF for class clustering and prediction of gene expression datafrom acute leukemia patients

! NMF (non-negative matrix factorization)

∑=

=≈

≈r

aaiaii HW

1

)()( µµµ WHG

WHG

G ��gene expression data matrix

W ��basis matrix (prototypes)

H ��encoding matrix (in low

dimension)

0,, ≥µµ aiai HWG

! NMF as a latent variable model

…

…

h1 hr

g1 g2 gn

W

Whg >=<

h2


42

Clustering Gene Expression Data

…�

�

�

�

�

�

�

�

�

�

�

�

7,129genes

38 samples

�

�

�

�

�

2 factors

… encoding

38 samples7,129genes

G W(?) H(?)

! Factors can capture the correlations between the genes using the valuesof expression level.

! Cluster training samples into 2 groups by NMF♦ Assign each sample to the factor (class) which has higher encoding value.

…

H1·

g1 g2 g7,129

W

H2 ·

g3 g4


43

Learning Procedure

Input : Gene expression data matrix, G (n × m)

Output : base matrix W (n × k), encoding matrix H (k × m)

n: data size, m: number of genes, k: number of latent variables

Objective function :

Procedure

1. Initialize W, H with random numbers.

2. Update W, H iteratively until max_iteration or some criterion is met.

0

1,0

≥

=≥ ∑

ij

jijij

H

WW

[ ]∑∑= =

−=n

i

m

iii WHWHGF1 1

)()log(µ

µµµ

∑=i i

iiaaa WH

GWHH

µ

µµµ )(

∑

∑

←

←

jja

iaia

ai

iiaia

W

WW

HWH

GWW µ

µ µ

µ

)(


44

Learning Curve

��

� ��

��

��

��


45

Experimental Results:Cancer Clustering

�

�

�

�

�

�

�

�

� � � � � � �

�

Accuracy: 0 ~1 error for the training data set


46

Diagnosis Using NMF

! For each test sample g, estimate the encoding vector h that bestapproximates the sample.

♦ W is the basis matrix computed duringtraining (fixed).

♦ As in training, assign each sample to the factor (class) which has thehighest encoding value.

! Accuracy: 1~2 error(s) for the test data set

�

�

�

�

2 factors

��

�

W

h(?)

g

7,129genes

7,129genes

…

h1

g1 g2 g7,129

W

h2

g3 g4


47

Clustering UsingGenerative Topographic Mapping

! GTM: a nonlinear, parametric mapping y(x;W)

from a latent space to a data space.

x2

x1

y�x�W�

t1

t3

t2

Grid

<Latent space> <Data space>


48

LearningGenerative Topographic Mapping! Learning algorithm

!Generate the grid of latent points.

!Generate the grid of latent function centers.

!Compute the matrix of basis function activations ΦΦΦΦ.

!Initialize weights W in Y = ΦΦΦΦW and the noise variance β.

!Compute ∆n,k = ||tn – ΦkW||2 = ||tn – yk(x,W)||2 for each n, k

!Repeat

4 - Compute the responsibility matrix R using ∆ and β. [E-Step]

4 Compute G=RTR

4 - Update W by ΦΦΦΦTGΦΦΦΦW= ΦΦΦΦTRT [M-step]

4 - Compute ∆ = ||t – ΦΦΦΦW||2

4 - Update β

!Until convergence


49

GTM: Visualization

! Posterior distribution in latent space given a data point t:

X(t) ~

! For a whole set of data: for each t, plot in the latent space

♦ Posterior mode:

♦ Posterior mean:


50

GTM: Clustering Experiment

! Gene Selection♦ Select about 50 genes out of 7,129 based on the three test

scores of cancer diagnosis.!Correlation metric (similar as t-test)

!Wilcoxon test scores (a nonparametric t-test)

!Median test scores (a nonparametric t-test)

! Clustering & Visualization♦ After learning a model, genes are plotted in the latent space.

♦ With the mapping in the latent space, clusters can beidentified.


51

List of Genes Selected


52

GTM: Learning Curve


53

GTM: Clustering Result

Genes with high expressionlevels in case of ALL(large P-metric value)

Genes with high expressionlevels in case of AML(negative large P-merticvalue)


54

Experimental Results:Clusters Found by GTM

! Three cell cycle-regulated clusters found by GTM [Shin etal. 2000]

(.894 .907 -.766 -.479)10 / 16 (62%)

0 / 16

35 / 18

/ 7

(-0.111 0.333)

(-0.111 0.111)

G1 c1

c2

(-.616 –1.01 1.832 1.596)0 / 5

3 / 5 (80%)

10 / 5

/ 3

(0.111 0.333)

(0.111 0.111)

G2/M c1

c2

(-.171 -.573 .091 .311)1 / 6

0 / 6

0 / 6

13 / 7

/ 2

/ 2

(0.111 0.333)

(-0.111 –0.111)

(0.323 0.1)

M/G1 c1

c2

c3

(1.075 1.482 -.233 -.375)5 / 5 (100%)5 / 5(0.111 –0.333)S

(.148 .184 -.367 -.044)1 / 25 /S/G2

Overall mean expression levels(Cln/b) of known genes

Correct no. / testdata

No. of train

Data/ no. incluster

Cluster center


55

Experimental Results:Comparison with Other Methods

! Comparison of prototype expression levels

(.66 .49 -.55 -.33)300

(total = 800)

(.92 .74 -.62 -.33)

(.79 .82 -.48 -.34)

122

74

(total = 570)

G1 c1

c2

(-.32 -.62 .49 .54)195(-.59 -.96 1.34 1.29)

(.08 -.30 .51 .57)

33

60

G2/M c1

c2

(-.21 -.61 -.04 .07)113(.82 .65 -.65 -.38)

(-.04 -.37 -.01 -.11)

(.32 .29 -.3 .05)

120

34

10

M/G1 c1

c2

c3

(.46 .47 -.43 -.18)71(.84 .81 -.42 -.33)25S

(.13 .05 -.16 .03)121(.13 -.06 -.1 .01)92S/G2

Mean expression

levels by Spellman

No. of selectedgenes by

Spellman

Mean expression

levels by GTM

No. ofselectedgenes

Conclusion


57

Summary and Future Work

! A class of nonstandard learning algorithms, i.e. probabilisticgraphical models, is presented for microarray data analysis:♦ Bayesian networks♦ Non-negative matrix factorization♦ Generative topographic mapping

! Probabilistic graphical models are useful for biological data mining:♦ Comprehensibility♦ Generative♦ Soft inference (clustering)♦ Dependency analysis

! Future work includes:♦ Construction of larger-scale networks♦ Acceleration of learning processes♦ Handling data sparseness and missing data problems


58

References

! [Chang et al. 2001] Chang, J.-H., Hwang, K.-B., Zhang, B.-T., Analysisof gene expression profiles and drug activity patterns for the molecularpharmacology of cancer, In Proceedings of CAMDA’01, 2001 (to appear).

! [Friedman et al. 1999] Friedman, N., Nachman, I., and Pe’er, D.,Learning Bayesian network structure from massive datasets: the “sparsecandidate” algorithm, In Proceedings of UAI’99, 1999.

! [Hwang et al. 2001] Hwang, K.-B., Cho, D.-Y., Park, S.-W., Kim, S.-D.,and Zhang, B.-T., Applying machine learning techniques to analysis ofgene expression data: cancer diagnosis, Methods of Microarray DataAnalysis, 167-182, 2001.

! [Scherf et al. 2000] Scherf, U. et al., A gene expression database for themolecular pharmacology of cancer, Nature Genetics, 24: 236-244, 2000.

! [Shin et al. 2000] Shin, H.-J., Chang, J.-H., Yang, J.-S., Zhang, B.-T.,and Augh, S.-J., Probabilistic models for clustering cell cycle-regulatedgenes in the Yeast, In Proceedings of CAMDA’00, 2000.


59

Acknowledgements

CAMDA-2000 & CAMDA-2001

! Sirk-June Augh (Molecular Biology)

! Jeong-Ho Chang (HelmholtzMachines, NMF, PLSA)

! Dong-Yeon Cho (Bayesian Evolution,Neural Trees)

! Kyu-Baek Hwang (BayesianNetworks)

! Sung-Dong Kim (HierarchicalClustering)

! Sang-Wook Park (RBF Networks)

! Hyung-Joo Shin (Probabilistic LSA)

! Jin-San Yang (Statistics, GTM, SOM)

! Ho-Jean Chung (Statistical Learning)! Seung-Woo Chung (ICA, PCA)! Jae-Hong Eom (Hidden Markov Models)! Tae-Jin Jeong (Reinforcement Learning)! Je-Gun Joung (Evolution, Neural Trees)! Chul-Joo Kang (Molecular Biology)! Jun-Shik Kim (Statistical Physics)! Sun Kim (Evolutionary Algorithms)! Yoo-Hwan Kim (ICA, Naïve Bayes, Boosting)! In-Hee Lee (Evolutionary Computation)! Jae-Won Lee (Reinforcement Learning)! Jong-Woo Lee (Latent Variable Models)! Si-Eun Lee (Bayesian Evolution)! Seung-Joon Lee (Reinforcement Learning)! Jong-Yoon Lim (SOM, Graphical Models)! Hyun-Koo Moon (HMM, Naïve Bayes)! Jang-Min O (Support Vector Machines, Boosting)! Seong-Bae Park (Decision Trees)! Hyung-Joo Shin (Probabilistic LSA)! Soo-Yong Shin (Evolution, Helmholtz Machines)


60

More informationat

http://cbit.snu.ac.kr/http://bi.snu.ac.kr/

Date post:	08-Jun-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering...

Documents