+ All Categories
Home > Documents > Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering...

Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering...

Date post: 08-Jun-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
30
Nonstandard Machine Learning Algorithms for Microarray Data Mining Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory School of Computer Science and Engineering Seoul National University http://cbit.snu.ac.kr/ http://bi.snu.ac.kr/ SNU Center for Bioinformation Technology (CBIT) 2 Outline ! Microarray Bioinformatics ! Probabilistic Graphical Models for Gene Expression Analysis ! Gene-Drug Dependency Analysis with Bayesian Networks ! Gene and Tissue Clustering with Latent Variable Models ! Summary and Future Work
Transcript
Page 1: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

Nonstandard Machine Learning Algorithmsfor

Microarray Data Mining

Byoung-Tak Zhang

Center for Bioinformation Technology (CBIT) &

Biointelligence Laboratory

School of Computer Science and Engineering

Seoul National University

http://cbit.snu.ac.kr/

http://bi.snu.ac.kr/

SNU Center for Bioinformation Technology (CBIT)

2

Outline

! Microarray Bioinformatics

! Probabilistic Graphical Models for GeneExpression Analysis

! Gene-Drug Dependency Analysis withBayesian Networks

! Gene and Tissue Clustering with LatentVariable Models

! Summary and Future Work

Page 2: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

Microarray Bioinformatics

SNU Center for Bioinformation Technology (CBIT)

4

Molecular Biology: Central Dogma

Page 3: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

5

Reductionistic and SyntheticApproaches in Biology

Biological System

(Organism)

Building Blocks

(Genes/Molecules)

Synthetic

Approach

(Bioinformatics)

Reductionistic

Approach

(Experiments)

SNU Center for Bioinformation Technology (CBIT)

6

Topics in Bioinformatics

Structure analysis4 Protein structure comparison4 Protein structure prediction4 RNA structure modeling

Pathway analysis4 Metabolic pathway4 Regulatory networks

Sequence analysis4 Sequence alignment4 Structure and function prediction4 Gene finding

Expression analysis4 Gene expression analysis4 Gene clustering

Page 4: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

7

Applications of DNA Microarrays

! Gene discovery

! Analysis of gene regulation

! Disease diagnosis

! Drug discovery: pharmacogenomics

! Toxicological research: toxicogenomics

SNU Center for Bioinformation Technology (CBIT)

8

A ComparativeHybridization Experiment

Imageanalysis

Page 5: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

9

Image Analysis

! Scanned images " probe intensities " numerical valuesfor higher-level analysis

Array target segmentation

Background intensityextraction Target detection

Target intensity extractionRatio analysis

SNU Center for Bioinformation Technology (CBIT)

10

Data Preparation for Data Mining

Sample 1 Sample 2 Sample i

Sample k Sample n

<Microarray image samples>

Sample 1

Gene 2

<Numerical data for data mining>

Imageanalysis

Page 6: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

11

Gene Expression Data Mining

Data preprocessing:

- Normalization

- Discretization

- Gene selectionLearning:

- Greedy search

- EM algorithm

Mining

- Classification

- Clustering

- Regulation analysis

SNU Center for Bioinformation Technology (CBIT)

12

Why Data Mining?

! Traditional analysis♦ One gene in one experiment

♦ Small data sets

♦ Simple analytical methods

! High-throughput genomicanalysis♦ Simultaneous measurements of

thousands of gene expression levels" massive data sets

♦ Statistical methods

♦ Machine learning approach

Page 7: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

13

An Example of Data Mining: Clustering

SNU Center for Bioinformation Technology (CBIT)

14

Analysis of DNA Microarray DataPrevious Work

! Characteristics of data♦ Analysis of expression ratio based on each sample♦ Analysis of time-variant data

! Clustering♦ Self-organizing maps [Golub et al., 1999]♦ Singular value decomposition [Orly Alter et al., 2000]

! Classification♦ Support vector machines [Brown et al., 2000]

! Gene identification♦ Information theory [Stefanie et al., 2000]

! Gene modeling♦ Bayesian networks [Friedman et.al., 2000]

Page 8: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

Probabilistic Graphical Models forGene Expression Analysis

SNU Center for Bioinformation Technology (CBIT)

16

An Introductory Example

! A Bayesian network classifier for acute leukemias [Hwanget al. 2001]

Zyxin

Leukemia

MB-1

C-mybLTC4S

<Network structure>

MB-1 geneMB-1

C-myb gene extractedfrom Human (c-myb)gene, complete primarycds, and five completealternatively splicedcds

C-myb

Leukotriene C4synthase (LTC4S) gene

LTC4S

ALL or AMLLeukemia

ZyxinZyxin

∏ == n

i iiXPP1

)|()( PaX

Page 9: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

17

Probabilistic Graphical Models

! The joint probability distribution over X = {X1, X2, …, Xn}♦ Chain rule

! Conditional independence X⊥ Y | Z

! Efficient representation of the joint probabilitydistribution using conditional independencies encoded bythe graph (network) structure

♦ Xi ⊥ {X1,…,Xi-1} - V(Xi) | V(Xi)

∏ = −= n

i ii XXXPP1 11 ),...,|()(X

)|(),|( ZXZYX PP =

∏∏

=

= −

=

=n

i ii

n

i ii

XVXP

XXXPP

1

1 11

))(|(

),...,|()(X

SNU Center for Bioinformation Technology (CBIT)

18

Analysis Procedure

Gene ExpressionData

Discretizationand Selection

LearningGraphical Models

Gene B

TargetGene D

Gene CGene A

Selected genes

and the target variable

Gene B

Target

Gene D

Gene CGene A

<Learned Graphical Models>

- Classification

- Dependency analysis

- Clustering

Page 10: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

19

Why Probabilistic Graphical Models?

! The joint probability distribution " all the knowledgeabout the (biological) system.

! Generative " modeling data sources! Efficient probabilistic inference " prediction

! The network structure " an insight into the intricaterelationships between components (i.e., genes) of thebiological system " dependency analysis

! Robust to the noise and error in gene expression data

SNU Center for Bioinformation Technology (CBIT)

20

Classes of Graphical Models

Graphical Models

- Boltzmann Machines- Markov Random Fields

- Bayesian Networks- Latent Variable Models- Hidden Markov Models- Generative Topographic Mapping- Non-negative Matrix Factorization

Undirected Directed

Page 11: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

Gene-Drug Dependency Analysiswith Bayesian Networks

SNU Center for Bioinformation Technology (CBIT)

22

NCI Drug Discovery Program

NCI 60cell linesdata set

Page 12: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

23

NCI 60 Cell Lines Data Set

! 60 human cancer cell lines♦ Colorectal, renal, ovarian, breast, prostate, lung, and central

nervous system origin cancers, as well as leukemias andmelanomas.

! Drug activity patterns (Database A)♦ Sulphorhodamine B assay " changes in total cellular protein

after 48 hours of drug treatment.

! Individual targets (matrix Ti)♦ Analysis of molecular characteristics other than mRNA

expressions.

! This study focuses on♦ Sensitivity to therapy #" not molecular consequences of

therapy

SNU Center for Bioinformation Technology (CBIT)

24

Gene-Drug Dependency Analysis UsingBayesian Networks

! To discover♦ Gene-gene expression dependency

♦ Gene expression-drug activity dependency

♦ Drug-drug activity dependency

Drug BDrug A

Gene A Drug C

Gene B

Local probabilitydistribution

Page 13: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

25

Bayesian Networks: the Qualitative Part

! The directed-acyclic graph (DAG) structure♦ Conditional independencies between variables " dependency

analysis

Gene AGene E

Gene B Gene D

Gene C

<Conditional independencies>

- Gene B and Gene D are independentgiven Gene A.

- Gene B asserts dependency between GeneA and Gene E.

- Gene A and Gene C are independentgiven Gene B.

<The DAG structure>

SNU Center for Bioinformation Technology (CBIT)

26

Bayesian Networks:the Quantitative Part

! The joint probability distribution over all the variables ina Bayesian network.

∏ == n

i iiXPP1

)|()( PaX

)|()|(),|()()(

),,,|(),,|(),|()|()(

),,,,(

BCPADPEABPEPAP

DBEACPBEADPEABPAEPAP

EDCBAP

==

Gene AGene E

Gene B Gene D

Gene C

Local probabilitydistribution for Xi

ii

ii

ijrijijij

iiiqii

Xr

q

P

XP

i

i

forstatesof#:

forionsconfiguratof#:

),...,|(Dir)(

)|(forparameter~),...,(

1

1

Pa

Pa

ααθθθθ

=

Page 14: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

27

Applications of Bayesian Networks

! Classification♦ Bayesian network classifiers

♦ Probabilistic inference " prediction " cancer classificationbased on gene expression

! Dependency analysis♦ Bayesian network structure " dependency between components

" putative causal relationships

SNU Center for Bioinformation Technology (CBIT)

28

Discretizaion of Gene Expression Levels

! The local probability distribution for each variable♦ Multinomial distribution with Dirichlet prior.

! Discretization of gene expression levels and drugactivities♦ All values " 0 and 1 according to the mean values across the

training samples.

structurenetworkthe:

)()1(,1)1(

)(

)...(

),...,,|(Dir)|(

1

1

1

1

21

h

r

kkr

k ijk

ijrij

ijrijijijh

ij

S

xxx

Sp

iijk

i

i

i

Γ=+Γ=Γ

⋅Γ

++Γ=

=

∏∏ =

=

αθα

αααααθθ

Page 15: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

29

Selection ofGenes and Drugs for Analysis

Known drugs

Missing valueselimination

- 566 human genesand ESTs

- 82 drugs

Learning Bayesiannetworks with 648nodes

SNU Center for Bioinformation Technology (CBIT)

30

Learn the Bayesian Network from Data

! Two approaches to Bayesian network learning♦ Dependency analysis-based approach

♦ Optimization-based approach!Network score: fitness of the network to training data D.

!Search the best-scoring network structure

! Scoring metric for the network♦ MDL (minimum description length) score

♦ BD (Bayesian Dirichlet) score

Page 16: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

31

BD (Bayesian Dirichlet) Score

! Assumptions♦ Multinomial sample♦ Parameter independence and modularity♦ Dirichlet prior

♦ Complete data

! BD score is calculated as follows:

♦ S: the network structure, D: the training data

∏ =

−⋅= i ijkr

k ijkh

ij cSp1

1)|( αθθ

.)(

)(

)(

)()(

)|()(),(

1 1 1∏ ∏ ∏= = = Γ+Γ

+ΓΓ

⋅=

⋅=n

i

q

j

r

kijk

ijkijk

ijij

iji iN

NSp

SDpSpSDp

αα

αα

Prior probability

Sufficient statisticscalculated from D

SNU Center for Bioinformation Technology (CBIT)

32

Search Strategy

! Finding the best network structure " NP-hard problem.♦ Exponential time complexity.♦ In general, greedy search or simulated annealing is used.

! General greedy search algorithm is not applicable to theconstruction of large-scale Bayesian networks withhundreds of nodes.♦ Time and space complexity

! Reduce the search space♦ “Sparse candidate algorithm” by [Friedman et al. 1999]♦ “Local to global search algorithm”

!Exploit the local search for the Markov blanket of each node toreduce the global search space.

Page 17: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

33

Local to Global Search Algorithm

������

� ������������

� ��������� ��������������������������

� �����������������������������

������� �� �������������������������

�� ����� ��������…����������������

���� ����� �����

�� ����� �� � ��� ��–��� ������� ���� ��� � ��������� ���� �� ���� ����!"���

�"� ≤ �#� ���

���������$������������������

�� %��� ��� � ���� &���� ����'�� ����� ���� ������ ���������� ��� ��������� � �� $������

�����������������!��#�������� ������������������

��$���������� ���������������������������!&�������!��#'���#���������������������

����������� !��������������#

������ ����� �����

��%���� �� �������������������������� ⊂ ���� �� ���(���)����� ��!�����#����

�����������������������������

.)),(|(),( ∑=i i

Bi DXPaXScoreDBScore

SNU Center for Bioinformation Technology (CBIT)

34

Learning Curve

BDscore

# of greedy search steps

Learning takesabout 8 minutes.

Page 18: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

35

Drug-Drug Dependency

! The pyramidine analogues♦ Aphidicolin-glycinate

♦ Floxuridine

♦ Cyclocytidine

♦ Cytarabine

! Shows similar activity pattern across 60 cancer cell lines

! Clustered together in [Scherf et al. 2000].

DNA synthesis inhibitor

SNU Center for Bioinformation Technology (CBIT)

36

Experimental Results:Drug-Drug Dependency

! Drug-drug activity correlations

<Part of the learned Bayesian network structure>

- Three drugs “Aphidicolin-glycinate”, “Floxuridine”, and“Cytarabine” directly depend oneach other.

- “Cyclocytidine” directlydepends on “Cytarabine” and viceversa.

Confirmation

Page 19: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

37

Gene-Drug Dependency

! Gene “ASNS”♦ Certain malignant cells, including many acute lymphoblastic

leukemias (ALL) lack asparagine synthetase (ASNS) "exogenous “L-asparagine” [Scherf et al. 2000].

Sensitivity of the drug“L-asparagine”

Expression levels of thegene “ASNS”

High negative correlation

SNU Center for Bioinformation Technology (CBIT)

38

Experimental Results:Gene-Drug Dependency

! Gene expression-drug activity correlations

- The negative correlationbetween “ASNS” and “L-asparagine” is mediated by twoother genes.

- The relationships revealed bythe Bayesian network is putativeand should be verified bybiological experiments. "exploratory analysis

`

<Part of the learned Bayesian network structure>

Confirmation

Discovery

Page 20: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

Gene and Tissue Clusteringwith Latent Variable Models

SNU Center for Bioinformation Technology (CBIT)

40

CAMDA-2000 Data Set

! Gene expression data forcancer prediction♦ Training data: 38 leukemia

samples (27 ALL , 11AML)

♦ Test data: 34 leukemiasamples (20 ALL , 14AML)

♦ Datasets containmeasurementscorresponding to ALL andAML samples from bonemarrow and peripheralblood.

Page 21: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

41

Cluster Analysis UsingNon-negative Matrix Factorization! Method

♦ Using NMF for class clustering and prediction of gene expression datafrom acute leukemia patients

! NMF (non-negative matrix factorization)

∑=

=≈

≈r

aaiaii HW

1

)()( µµµ WHG

WHG

G ��gene expression data matrix

W ��basis matrix (prototypes)

H ��encoding matrix (in low

dimension)

0,, ≥µµ aiai HWG

! NMF as a latent variable model

h1 hr

g1 g2 gn

W

Whg >=<

h2

SNU Center for Bioinformation Technology (CBIT)

42

Clustering Gene Expression Data

…�

7,129genes

38 samples

2 factors

… encoding

38 samples7,129genes

G W(?) H(?)

! Factors can capture the correlations between the genes using the valuesof expression level.

! Cluster training samples into 2 groups by NMF♦ Assign each sample to the factor (class) which has higher encoding value.

H1·

g1 g2 g7,129

W

H2 ·

g3 g4

Page 22: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

43

Learning Procedure

Input : Gene expression data matrix, G (n × m)

Output : base matrix W (n × k), encoding matrix H (k × m)

n: data size, m: number of genes, k: number of latent variables

Objective function :

Procedure

1. Initialize W, H with random numbers.

2. Update W, H iteratively until max_iteration or some criterion is met.

0

1,0

=≥ ∑

ij

jijij

H

WW

[ ]∑∑= =

−=n

i

m

iii WHWHGF1 1

)()log(µ

µµµ

∑=i i

iiaaa WH

GWHH

µ

µµµ )(

jja

iaia

ai

iiaia

W

WW

HWH

GWW µ

µ µ

µ

)(

SNU Center for Bioinformation Technology (CBIT)

44

Learning Curve

������������

� �� � �� �� �� �� �� �� �� ���

������������������

������������

��������������

Page 23: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

45

Experimental Results:Cancer Clustering

� � � � � � �

Accuracy: 0 ~1 error for the training data set

SNU Center for Bioinformation Technology (CBIT)

46

Diagnosis Using NMF

! For each test sample g, estimate the encoding vector h that bestapproximates the sample.

♦ W is the basis matrix computed duringtraining (fixed).

♦ As in training, assign each sample to the factor (class) which has thehighest encoding value.

! Accuracy: 1~2 error(s) for the test data set

2 factors

��

W

h(?)

g

7,129genes

7,129genes

h1

g1 g2 g7,129

W

h2

g3 g4

Page 24: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

47

Clustering UsingGenerative Topographic Mapping

! GTM: a nonlinear, parametric mapping y(x;W)

from a latent space to a data space.

x2

x1

y�x�W�

t1

t3

t2

Grid

<Latent space> <Data space>

SNU Center for Bioinformation Technology (CBIT)

48

LearningGenerative Topographic Mapping! Learning algorithm

!Generate the grid of latent points.

!Generate the grid of latent function centers.

!Compute the matrix of basis function activations ΦΦΦΦ.

!Initialize weights W in Y = ΦΦΦΦW and the noise variance β.

!Compute ∆n,k = ||tn – ΦkW||2 = ||tn – yk(x,W)||2 for each n, k

!Repeat

4 - Compute the responsibility matrix R using ∆ and β. [E-Step]

4 Compute G=RTR

4 - Update W by ΦΦΦΦTGΦΦΦΦW= ΦΦΦΦTRT [M-step]

4 - Compute ∆ = ||t – ΦΦΦΦW||2

4 - Update β

!Until convergence

Page 25: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

49

GTM: Visualization

! Posterior distribution in latent space given a data point t:

X(t) ~

! For a whole set of data: for each t, plot in the latent space

♦ Posterior mode:

♦ Posterior mean:

SNU Center for Bioinformation Technology (CBIT)

50

GTM: Clustering Experiment

! Gene Selection♦ Select about 50 genes out of 7,129 based on the three test

scores of cancer diagnosis.!Correlation metric (similar as t-test)

!Wilcoxon test scores (a nonparametric t-test)

!Median test scores (a nonparametric t-test)

! Clustering & Visualization♦ After learning a model, genes are plotted in the latent space.

♦ With the mapping in the latent space, clusters can beidentified.

Page 26: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

51

List of Genes Selected

SNU Center for Bioinformation Technology (CBIT)

52

GTM: Learning Curve

Page 27: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

53

GTM: Clustering Result

Genes with high expressionlevels in case of ALL(large P-metric value)

Genes with high expressionlevels in case of AML(negative large P-merticvalue)

SNU Center for Bioinformation Technology (CBIT)

54

Experimental Results:Clusters Found by GTM

! Three cell cycle-regulated clusters found by GTM [Shin etal. 2000]

(.894 .907 -.766 -.479)10 / 16 (62%)

0 / 16

35 / 18

/ 7

(-0.111 0.333)

(-0.111 0.111)

G1 c1

c2

(-.616 –1.01 1.832 1.596)0 / 5

3 / 5 (80%)

10 / 5

/ 3

(0.111 0.333)

(0.111 0.111)

G2/M c1

c2

(-.171 -.573 .091 .311)1 / 6

0 / 6

0 / 6

13 / 7

/ 2

/ 2

(0.111 0.333)

(-0.111 –0.111)

(0.323 0.1)

M/G1 c1

c2

c3

(1.075 1.482 -.233 -.375)5 / 5 (100%)5 / 5(0.111 –0.333)S

(.148 .184 -.367 -.044)1 / 25 /S/G2

Overall mean expression levels(Cln/b) of known genes

Correct no. / testdata

No. of train

Data/ no. incluster

Cluster center

Page 28: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

55

Experimental Results:Comparison with Other Methods

! Comparison of prototype expression levels

(.66 .49 -.55 -.33)300

(total = 800)

(.92 .74 -.62 -.33)

(.79 .82 -.48 -.34)

122

74

(total = 570)

G1 c1

c2

(-.32 -.62 .49 .54)195(-.59 -.96 1.34 1.29)

(.08 -.30 .51 .57)

33

60

G2/M c1

c2

(-.21 -.61 -.04 .07)113(.82 .65 -.65 -.38)

(-.04 -.37 -.01 -.11)

(.32 .29 -.3 .05)

120

34

10

M/G1 c1

c2

c3

(.46 .47 -.43 -.18)71(.84 .81 -.42 -.33)25S

(.13 .05 -.16 .03)121(.13 -.06 -.1 .01)92S/G2

Mean expression

levels by Spellman

No. of selectedgenes by

Spellman

Mean expression

levels by GTM

No. ofselectedgenes

Conclusion

Page 29: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

57

Summary and Future Work

! A class of nonstandard learning algorithms, i.e. probabilisticgraphical models, is presented for microarray data analysis:♦ Bayesian networks♦ Non-negative matrix factorization♦ Generative topographic mapping

! Probabilistic graphical models are useful for biological data mining:♦ Comprehensibility♦ Generative♦ Soft inference (clustering)♦ Dependency analysis

! Future work includes:♦ Construction of larger-scale networks♦ Acceleration of learning processes♦ Handling data sparseness and missing data problems

SNU Center for Bioinformation Technology (CBIT)

58

References

! [Chang et al. 2001] Chang, J.-H., Hwang, K.-B., Zhang, B.-T., Analysisof gene expression profiles and drug activity patterns for the molecularpharmacology of cancer, In Proceedings of CAMDA’01, 2001 (to appear).

! [Friedman et al. 1999] Friedman, N., Nachman, I., and Pe’er, D.,Learning Bayesian network structure from massive datasets: the “sparsecandidate” algorithm, In Proceedings of UAI’99, 1999.

! [Hwang et al. 2001] Hwang, K.-B., Cho, D.-Y., Park, S.-W., Kim, S.-D.,and Zhang, B.-T., Applying machine learning techniques to analysis ofgene expression data: cancer diagnosis, Methods of Microarray DataAnalysis, 167-182, 2001.

! [Scherf et al. 2000] Scherf, U. et al., A gene expression database for themolecular pharmacology of cancer, Nature Genetics, 24: 236-244, 2000.

! [Shin et al. 2000] Shin, H.-J., Chang, J.-H., Yang, J.-S., Zhang, B.-T.,and Augh, S.-J., Probabilistic models for clustering cell cycle-regulatedgenes in the Yeast, In Proceedings of CAMDA’00, 2000.

Page 30: Nonstandard Machine Learning Algorithms for Microarray ... · An Example of Data Mining: Clustering SNU Center for Bioinformation Technology (CBIT) 14 Analysis of DNA Microarray Data

SNU Center for Bioinformation Technology (CBIT)

59

Acknowledgements

CAMDA-2000 & CAMDA-2001

! Sirk-June Augh (Molecular Biology)

! Jeong-Ho Chang (HelmholtzMachines, NMF, PLSA)

! Dong-Yeon Cho (Bayesian Evolution,Neural Trees)

! Kyu-Baek Hwang (BayesianNetworks)

! Sung-Dong Kim (HierarchicalClustering)

! Sang-Wook Park (RBF Networks)

! Hyung-Joo Shin (Probabilistic LSA)

! Jin-San Yang (Statistics, GTM, SOM)

! Ho-Jean Chung (Statistical Learning)! Seung-Woo Chung (ICA, PCA)! Jae-Hong Eom (Hidden Markov Models)! Tae-Jin Jeong (Reinforcement Learning)! Je-Gun Joung (Evolution, Neural Trees)! Chul-Joo Kang (Molecular Biology)! Jun-Shik Kim (Statistical Physics)! Sun Kim (Evolutionary Algorithms)! Yoo-Hwan Kim (ICA, Naïve Bayes, Boosting)! In-Hee Lee (Evolutionary Computation)! Jae-Won Lee (Reinforcement Learning)! Jong-Woo Lee (Latent Variable Models)! Si-Eun Lee (Bayesian Evolution)! Seung-Joon Lee (Reinforcement Learning)! Jong-Yoon Lim (SOM, Graphical Models)! Hyun-Koo Moon (HMM, Naïve Bayes)! Jang-Min O (Support Vector Machines, Boosting)! Seong-Bae Park (Decision Trees)! Hyung-Joo Shin (Probabilistic LSA)! Soo-Yong Shin (Evolution, Helmholtz Machines)

SNU Center for Bioinformation Technology (CBIT)

60

More informationat

http://cbit.snu.ac.kr/http://bi.snu.ac.kr/


Recommended