+ All Categories
Transcript

Computer Science

TRICLUSTERAn Effective Algorithm for

Mining Coherent Clusters in 3D Microarray Data

Mohammed J. Zaki & Lizhuang Zhao

Department of Computer Science,

Rensselaer Polytechnic Institute (RPI), Troy, NY

{zhaol2, zaki}@cs.rpi.edu

Computer ScienceComputer ScienceMicroarray Data

Essential source of information about the Gene Expression within a cell

Typically 2D: Genes x Samples (Genes x Time) Measure the expression level of genes in different

samples Labeled samples: Classification (cancer vs. non-

cancer) Non-labeled samples: Clustering (Bi-clusters) Goal: Identify the “expression” patterns,

providing clues to the gene regulatory networks within a cell

Computer ScienceComputer ScienceWhy Biclustering?

v21 v22 v23 v24 v25

v41 v42 v43 v44 v45

v51 v52 v53 v54 v55

s1 s2 s3 s4 s5

g1

g2

g3

g4

g5

v22 v23 v25

v42 v43 v45

v52 v53 v55

s1 s2 s3 s4 s5

g1

g2

g3

g4

g5

(g2, g4, g5)×(s2, s3, s5)(g2, g4, g5)

Biclusterfull-space cluster

some genes similarly expressed in some samples

Computer ScienceComputer Science

Constant

1.0 1.4 2.0

2.0 2.8 4.0

2.5 3.5 5.0

more

genera

l

2 2 2

2 2 2

2 2 2

1 2 5

1 2 5

1 2 5

1 1 1

2 2 2

5 5 5

1.0 1.4 2.0

2.0 2.4 3.0

2.5 2.9 3.5

4 1 7

3 2 5

6 3 8

Order: 2 1 3

Scaling/Shifting

Order Preserving

Different “Homogeneity” orSimilarity Criteria

Col Row All

Note: small noise is allowed in all expression values

Scale=1.4

Shift=0.4

Computer ScienceComputer ScienceWhy TriCluster?

Typical microarray data is 2D (gene x sample) Temporal expression very important tool

How does gene expression evolve in time? Find clusters over genes x samples x time

Spatial expression also of interest How does gene expression differ in space (e.g.,

different regions of mouse brain)? Find clusters over gene x samples x space

Combine temporal and spatial expression Find clusters over gene x time x space, etc.

There is an emerging need to mine 3D data

Computer ScienceComputer ScienceTriCluster: Our Contributions First algorithm to mine tri-clusters in 3D

microarray data Complete and deterministic Mine maximal clusters satisfying given

homogeneity criteria Constant: column, row, all Scaling & Shifting

Clusters can be overlapping; optionally delete/merge clusters having large overlap

Propose a set of metrics for cluster evaluation Use Gene Ontology (GO) to access biological

significance

Computer ScienceComputer ScienceDefinitions

G is a set of genes {g0, g1, …, gn-1}

S is a set of samples {s0, s1, …, sm-1}

T is a set of time courses {t0, t1, …, tl-1}

3D Real-valued Dataset D = {dijk} G x S x T dijk is the expression value of gene gi in sample sj at

time tk

triCluster is a maximal submatrix of D that satisfies some homogeneity conditions C = X x Y x Z = {cijk} X G, Y S, Z T Given homogeneity conditions

Computer ScienceComputer ScienceScaling triCluster Example

1 3 4

2 6 8

5 15 20

4 12 16

8 24 32

20 60 80

2 6 8

4 12 16

10 30 40

1 3 4

1

2

5

1

4

2

Ratios:

Genes

Samples

Time

Note: small noise is allowed

Computer ScienceComputer ScienceTriCluster Concepts

C = X x Y x Z = {cijk} is a triCluster iff C is maximal (no C’ C) C has sufficient size: |X| mg, |Y| ms, |Z| mt Noise/error threshold is satisfied for any C22

C22 = is an arbitrary 2x2 submatrix of C

Let ri = | cia/cib| and rj = | cja/cjb| Max(ri/rj) / Min(ri/rj) – 1

Range threshold a is satisfied for each dim a = | cijk – cxyz | If j=y, k=z, then g (similarly define s, t)

jbja

ibia

cc

cc

Computer ScienceComputer ScienceTriCluster Flexibility

Cluster definition is symmetric Any ordering of dimensions allowed A/C≈B/D ↔ A/B≈C/D ↔ AD≈BC

Can mine several types of clusters Typically 0 to allow small noise/error Approx constant cluster: g 0 and s 0 and t 0 Approx single dim constant: g 0 or s 0 or t 0 Approx two dim constant: (g 0 and s 0) or

(g 0 and t 0) or (s 0 and t 0) Scaling cluster: g and s and t are unconstrained Shifting cluster: if eC

is a scaling C is a shifting

A C

B D

A B

C D=T

Computer ScienceComputer ScienceTriCluster Algorithm

Compute maximal biclusters on G x S for each time slice t T Construct range multigraph Find maximal cliques

Compute triclusters from biclusters Construct new multigraph (T x

biclusters) Find maximal cliques

Merge/Prune overlapping clusters

Computer ScienceComputer ScienceMaximal Biclusters

Mine each GxS time-slice for maximal biclusters

For each pair of samples, get valid ratio ranges within ε and gene-sets

Construct a Range Multigraph Mine maximal cliques Each clique/cluster can contribute to

some valid tricluster

Computer ScienceComputer ScienceValid Ratio Ranges:Each Column Pair

Take ratio s0 and s6 and construct valid ranges: Range contains at least mg values within ε (noise threshold)

ε=0.05, mg=3, then 3.0×(1+ε)=3.15 range = [3, 3.15]

Other ranges = [3.3, 3.465], and so on

Construct gene-sets: [3, 3.15] has genes {g1, g4, g8}

Original Data After row/col permutation

Range Example

Computer ScienceComputer ScienceRange Multigraph:pair of samples

Construct valid ratios & gene-sets for s1/s4

Ratio = 1/1, gene-set = {g2g6g0g9g7}

Ratio = 5/4, gene-set = {g4g8g1} Construct ratios/gene-sets for other pairs

Multigraph

Computer ScienceComputer ScienceRange Multigraph: complete

Construct ratios/gene-sets for all sample pairs

Computer ScienceComputer ScienceMaximal Clique Mining

Perform recursive depth-first search Maintain valid gene-sets for each node Intersect gene-sets with each outgoing edge

{g2g6g0g9g7} {g2g6g0g9} = {g2g6g0g9} Prune if various criteria not met (size, dim

range)

s2 s4 s6

s0s1s5

s3

Computer ScienceComputer ScienceMine triClusters

Let Bt be the set of maximal biclusters for time slice t

Construct new multigraph Each time point is a vertex Each pair of highly overlapping

biclusters (gene-set, samples) forms an edge between time ti and tj

Call maximal clique mining to obtain maximal triclusters

Computer ScienceComputer ScienceConstructing triClusters

Computer ScienceComputer ScienceConstructing triClusters

ti

tk

tj

Computer ScienceComputer ScienceConstructing triClusters

ti

tk

tj

Computer ScienceComputer SciencePrune and Merge

A

BB

Ai

Aj

AB

Prune BLB-A/LB <

Prune BLB- A/LB <

Merge A & BL(A+B)-A-B/ L(A+B) <

Cluster Span: LC = {(i,j,k) | gi, sj, tk C }

LAB = LA LB

LA-B = LA – LB

LA+B = (LA – LB) (LB – LA) (LA LB)

Computer ScienceComputer ScienceMetrics for Measuring Clustering Quality

NumClusters Number of Clusters

Span Span (X×Y×Z)=|X|×|Y|×|Z|

ElementSum Sum of all cluster Spans (count multiple

times)

Coverage Union of all cluster Spans (count once)

Overlap (ElementSum - Coverage) / Coverage

We want high coverage with small overlap

Computer ScienceComputer ScienceSynthetic Data Generation Experiments:1.4Ghz, 448MB, Linux/Vmware Synthetic data for parameter evaluation

Input parameters: |G|=4000, |S|=30, |T|=20 Number of cluster to embed = 10 Overlap % among clusters = 20% Noise for expression values = 3% Cluster size range = 150x6x4 (some

variation) Generate clusters with values within some range Fill rest of cells with random noise Do random permutations along each dimension We vary one parameter and keep others fixed

Computer ScienceComputer Science

Results on Synthetic DatasetsTim

e (

sec)

Tim

e (

sec)

Tim

e (

sec)

Tim

e (

sec)

Tim

e (

sec)

Tim

e (

sec)

Number of Genes

Variation (%)Overlap (%)Number of Clusters

Number of Time-points

Number of Samples

Computer ScienceComputer ScienceResults on Yeast CellCycle Dataset http://genome-www.stanford.edu/cellcycle Elutriation Experiment

7679 genes 14 time points (0 to 390mins @ 30 min gaps) No real samples: use raw expression values of 13

attributes as samples (Cyc3, Cyc5, ratios, etc) GxSxT = 7679 x 13 x 14

Note: actual 3D data will become publicly available soon (e.g. Mouse Brain Atlas: genes x space x time)

Run TriCluster: mg=50, ms= 4, mt= 5, ε = 0.03 Found 5 clusters in 28s, overlap=0,

coverage=6250 2D view of cluster C0 (51x4x5) shown next

Computer ScienceComputer Science

2D Views of cluster C0 on yeast data

Sample Curves Time Curves Gene Curves

t=120

t=210

t=270

t=330

t=390

s=CH2I

s=CH2D

s=CH2IN

s=CH2DN

s=CH2I

s=CH2D

s=CH2IN

s=CH2DN

Time pointsGenesGenes

Expre

ssio

n V

alu

es

Expre

ssio

n V

alu

es

Expre

ssio

n V

alu

es

Computer ScienceComputer Science

Results on Yeast Cell Cycle Dataset:Gene Ontology

Cluster#Gene

sProcess Function Cellular Location

C0 51

ubiquitin cycle (n=3, p=0.00346), protein polyubiquitination (n=2, p=0.00796), carbohydrate biosynthesis (n=3, p=0.00946)

C152

G1/S transition of mitotic cell cycle (n=3, p=0.00468), mRNA polyadenylylation (n=2, p=0.00826)

protein phosphatase regulator activity (n=2,p=0.00397) , phosphatase regulator activity (n=2, p=0.00397)

C257

lipid transport (n=2, p=0.0089) oxidoreductase activity (n=7, p=0.00239), lipid transporter activity (n=2, p=0.00627), antioxidant activity (n=2, p=0.00797)

cytoplasm (n=41, p=0.00052), microsome (n=2, p=0.00627), vesicular fraction (n=2, 0.00627), microbody (n=3, p=0.00929), peroxisome (n=3, p=0.00929)

C397

physiological process (n=76, p=0.0017), organelle organization and biogenesis (n=15, p=0.00173), localization (n=21, p=0.00537)

MAP kinase activity (n=2, p=0.00209), deaminase activity (n=2, p=0.00804), hydrolase activity, acting on carbon-nitrogen, but not peptide, bonds (n=4, p=0.00918), receptor signaling protein serine/threonine kinase activity (n=2, p=0.00964)

membrane (n=29, p=9.36e-06), cell (n=86, p=0.0003), endoplasmic reticulum (n=13, p=0.00112), vacuolar membrane (n=6, p=0.0015), cytoplasm (n=63, p=0.00169) intracellular (n=79, p=0.00209), endoplasmic reticulum membrane (n=6, p=0.00289), integral to endoplasmic reticulum membrane (n=3, p=0.00328), nuclear envelope-endoplasmic reticulum network (n=6, p=0.00488)

C466

pantothenate biosynthesis (n=2, p=0.00246), pantothenate metabolism (n=2, p=0.00245), transport (n=16, p=0.00332), localization (n=16, p=0.00453)

ubiquitin conjugating enzyme activity (n=2, p=0.00833), lipid transporter activity (n=2, p=0.00833)

Golgi vesicle (n=2, p=0.00729)

Significant (p-value < 0.01) Shared Gene Ontology (GO) Terms (Process, Function, Location) for Genes in Different Clusters

Computer ScienceComputer Science

Cluster #Genes Process

C3 97

physiological process (n=76, p=0.0017), organelle organization and biogenesis (n=15, p=0.00173), localization (n=21, p=0.00537)

Results on Yeast Cell CycleSpecific Cluster

Different clusters show different shared terms

Results could be potentially biologically significant

Computer ScienceComputer ScienceSummary Contributions

First algorithm to mine triclusters from 3D microarrays

Complete, deterministic Allows small noise Flexible: constant, single/two dim, scaling,

shifting Allows arbitrary overlap (merge/prune) Potentially biologically significant clusters (GO)!

Future Work Extend from 3-D to k-D datasets Allow different pattern types along different

axes (scaling along GxS, shifting along T, etc.) Enhance clique mining step from multigraphs


Top Related