Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui...

Mining Order Preserving Submatrices (OPSMs) from data with replicates

Presenter: Chun-Kit Chui

Supervisor: Dr. Ben Kao

Presentation Outline

Conventional Order Preserving Submatrixes

Multiple-value matrix data model Mining OPSMs from the new data model Efficient methods – bounding techniques Experimental evaluation

Order Preserving Submatrix

The Order Preserving Submatrix is a pattern-based subspace clustering problem which usually applies on mining gene expression datasets.

Gene expression dataset:

One of the goals of microarray data analysis is to group the coexpressing genes into a cluster. Genes that shows similar changes of expression levels (up or down of

expression values) under some environmental stimuli (experimental conditions).

C1 C2 C3 C4 C5 C6 C7 C8

G1 36 32 12 19 18 42 33 8

G2 11 22 33 24 30 3 9 23

G3 14 18 48 28 38 11 33 21

G4 20 14 5 10 7 24 44 13

G5 38 25 10 24 19 39 8 22

Experimental conditions (Experimental settings)

Expression value of a geneunder an experimental setting.

Genes


Gene’s expression level may vary substantially due to its sensitivity to experimental settings. i.e. the change of expression level in response to the change of

experimental condition is often considered more meaningful than its actual value.

In microarray experiments, the design of experimental conditions is often based on little knowledge of gene functions. i.e. clustering algorithms should consider a subset of conditions which

maximizes the similarities among a subset of genes.

C1 C2 C3 C4 C5 C6 C7 C8

G1 36 32 12 19 18 42 33 8

G2 11 22 33 24 30 3 9 23

G3 14 18 48 28 38 11 33 21

G4 20 14 5 10 7 24 44 13

G5 38 25 10 24 19 39 8 22

Raw gene expression datasetData matrix plotted

No obvious patterns observed.


C1 C2 C3 C4 C5 C6 C7 C8

G1 36 32 12 19 18 42 33 8

G2 11 22 33 24 30 3 9 23

G3 14 18 48 28 38 11 33 21

G4 20 14 5 10 7 24 44 13

G5 38 25 10 24 19 39 8 22

Raw gene expression datasetData matrix plotted

C3 C5 C4 C2 C1 C6

G1 12 18 19 32 36 42

G4 5 7 10 14 20 24

G5 10 19 24 25 38 39

Reordered subset of columns(experimental conditions)

Subset of Genes


No obvious patterns observed.

Consider a subset of experimental conditions and a subset of genes and we reorder the columns.

The change of expression values of the genes in response to the change of experimental condition is the same. They are all increasing.

Identifying this submatrix (subset of genes and conditions and the ordering of conditions) is particularly useful for the biologists.E.g. infer gene regulatory networks.


Given a data matrix M with n rows and m columns.

An order preserving submatrix S is A subset of row R.

E.g. R={G1,G4,G5} A subset of column C.

E.g. C={C1,C2,C3,C4,C5,C6} A column order

constraint, s.t. the entries of all rows in R are increasingly ordered.

E.g. <C3,C5,C4,C2,C1,C6> Mining OPSMs: Find all

OPSMs with |R| greater than or equal to a user specified threshold (frequent) and |C| greater than or equal to c.

C3 C5 C4 C2 C1 C6

G1 12 18 19 32 36 42

G4 5 7 10 14 20 24

G5 10 19 24 25 38 39

Subset of Genes


Reordered subset of columns(experimental conditions)

C1 C2 C3 C4 C5 C6 … Cm

G1 36 32 12 19 18 42 … 8

G2 11 22 33 24 30 3 … 23

G3 3 25 31 22 11 4 … 26

G4 20 14 5 10 7 24 … 13

G5 38 25 10 24 19 39 … 22

… … … … … … … … …

Gn …

Raw gene expression dataset

Mining OPSMs

Mining OPSMs can be reduced to a special case of sequential pattern mining. Transform the data matrix to a sequence dataset by sorting

each row in ascending order and replace the entries with the corresponding column labels.

Each sequential pattern uniquely specifies an OPSM, with all the supporting sequences as the supporting rows.

A row supports an OPSM if the order constraint of the OPSM is a subsequence of the transformed column sequence of the row.

C1 C2 C3 C4 C5 C6 C7 C8

G1 36 32 12 19 18 42 33 8

G2 11 22 33 24 30 3 9 23

G3 14 18 48 28 38 11 33 21

G4 20 14 5 10 7 24 44 13

G5 38 25 10 24 19 39 8 22

Raw data matrixGene Column Sequence

G1 <8,3,5,4,2,7,1,6>

G2 <6,7,1,2,8,4,5,3>

G3 <6,1,2,8,4,7,5,3>

G4 <3,5,4,8,2,1,6,7>

G5 <7,3,5,8,4,2,1,6>

Transformed sequence dataset

Sequential pattern(OPSM)

<3,5,4,2,1,6>

An OPSM

Mining OPSMs – Two properties

Apriori property If the OPSM with column order constraint

<a,b> is infrequent (#rows smaller than a user specified support threshold), the OPSMs with column order constraint as it’s superset (e.g. <a,b,c>, <a,b,c,d>, <c,a,b,d>) are all infrequent.

E.g. OPSM with column order constraint <C8,C3,C5,C4,C2> is supported by G1 only, it is infrequent.

Adding more constraints to the OPSM will only reduce the number of supporting sequences.

OPSM <C8,C3,C5,C4,C2,C1> must be infrequent. According to the Apriori property, we can

have an iterative method to prune the search space.

Gene Column Sequence

G1 <8,3,5,4,2,7,1,6>

G2 <6,7,1,2,8,4,5,3>

G3 <6,1,2,8,4,7,5,3>

G4 <3,5,4,8,2,1,6,7>

G5 <7,3,5,8,4,2,1,6>


i.e. Mine the frequent size-k OPSMs, identify infrequent size-k+1 OPSMs and prune them, continue to the next (k+1) iteration.

Mining OPSMs – Two properties

Transitivity If the column order constraint of an OPSM

O1 is <x1,x2,…,xi,y1,y2,…,yj> and another OPSM O2 is <y1,y2,…,yj,z1,z2,….,zk>. Then the intersection of R1 and R2 yields the set of supporting rows for OPSM O3 with column order constraint < x1,x2,…,xi,y1,y2,…, yj,z1,z2,….,zk>.

E.g. OPSM <C3,C5,C7> is supported by G1, G4, OPSM <C5,C7,C1> is supported by G1.

OPSM <C3,C5,C7,C1> is supported by G1.

According to the Transitivity property, we can obtain the supports of size-(k+1) OPSMs from size-k frequent OPSMs without rescan the sequence dataset.

Gene Column Sequence

G1 <8,3,5,4,2,7,1,6>

G2 <6,7,1,2,8,4,5,3>

G3 <6,1,2,8,4,7,5,3>

G4 <3,5,4,8,2,1,6,7>

G5 <7,3,5,8,4,2,1,6>


Gene <3,5,7> <5,7,1> <3,5,7,1>

G1

G2

G3

G4

G5

The Subsequence function verifies the supporting rows of the candidate

OPSMs.

Mining OPSMs

OPSM-GenOPSM-GenSubsequenceFunction

SubsequenceFunction

Size-2 OPSMs candidates

Frequent size-k OPSMs

Size k+1Candidate OPSMs

According to the Apriori property, we only generate those

size k+1 candidates with all proper subsets being frequent.

According to the Transitivity property, we can obtain the supporting rows of the size k+1 candidates from the size k large OPSMs. No need to scan the dataset.

Start from mining size-2 OPSMs because size-1

OPSMs does not have any “orderings”.

Those OPSMs with #supporting rows greater

than or equal to the support threshold are frequent, they are passed into the OPSM-Gen procedure.

The algorithm terminates when no more candidates are generated.

Mining OPSMs – Data Structure

Subsequence Function

Subsequence Function OPSM-GenOPSM-Gen

Column order constraint Supporting rows

<a,b,c> 1,3,5,6

… …

Head/Tail OPSM ptr

H <a,b,c>

a

b

b

c d

c

a d

root

Head/Tail OPSM ptr

T <a,b,c>

A Head-tail tree data structure was proposed to facilitate candidate generation in the OPSM-Gen procedure.

To add an OPSM <a,b,c> into the Head-tail tree, we first follow the “head” of the OPSM (i.e. <a,b>) to traverse the tree and store the OPSM in the leaf node, we indicate it reaches the leaf node by following the “head” (H) of the OPSM.

Then, we follow the “tail” of the OPSM (i.e. <b,c>) to traverse the

tree and store the OPSM in the leaf node, indicate it reaches the leaf node

by following the “tail” (T) of the OPSM.

Size-3 frequent OPSMs table

In the OPSM-Gen procedure, we have the size-k frequent OPSMs table storing the column order

constraint (OPSMs) and the support rows (bit vectors/ tid-lists)

of the OPSMs.



<a,b,c> 1,3,5,6

<a,b,d> 4,5,6

<c,a,b> 1,2,3,4,5

<e,a,b> 2,5,6,7

…

Head/Tail OPSM ptr

H <a,b,c>

H <a,b,d>

T <c,a,b>

T <e,a,b>

a

b

b

c d

c

a d

root

Head/Tail OPSM ptr

T <a,b,c>

According to the Transitivity property, to generate the size-4 candidates, we can simply merge the tail OPSMs and head OPSMs

within each leaf node.




OPSM-GenOPSM-Gen



<a,b,c> 1,3,5,6

<a,b,d> 4,5,6

<c,a,b> 1,2,3,4,5

<e,a,b> 2,5,6,7

…

Head/Tail OPSM ptr

H <a,b,c>

H <a,b,d>

T <c,a,b>

T <e,a,b>

a

b

b

c d

c

a d

root

Head/Tail OPSM ptr

T <a,b,c>

A Head-tail tree data structure was proposed to facilitate candidate generation in the OPSM-Gen procedure.

According to the Transitivity property, to generate the size-4 candidates, we can simply merge the tail OPSMs and head OPSMs

within each leaf node.


For example, Tail <c,a,b> and Head

<a,b,d> can be merged to form a new size-4

OPSM <c,a,b,d>.

Follow their ptrs, we can find their supporting rows. By the Transitivity property, OPSM

<c,a,b,d> is supported by row 4 and 5 (intersection of the two support row

vectors).



The replicated data model

Multiple-value matrix

Multiple-value matrix data model

Recently, researches in microarray data analysis have shown that any single microarray output is subject to substantial variability.

[Stefan Bleuler et al, Evo Workshops 2005; R. Coombes et al, Journal of Computational Biology 2002; G. C. Tseng et al, Nucleic Acids Research. 2001; J. P. Brody et al., National Academy of Sciences.,2002]

The error of the expression values of the genes under an experimental condition can be large.

Replication is strongly supported by biologists as a straightforward approach for improving the quality of inferences made from experimental studies.

[T.-K. Jenssen et al, Nucleic Acids Research 2002 ; M.-L. T. Lee et al, PNAS 2000 ; J. Novak et al, Genomics,2002; R. Ramakrishnan et al, Nucleic Acids Research 2002]

Technical Biological

VariabilityMeasurement error of the experimental system (comparatively small).

Natural heterogeneity among individuals (can be large).

ReplicationRepeated measurement of the same sample.

The practice of measuring multiple samples under the same experimental condition.


The practice of conducting repeated experiments is stressed in many literatures on microarray studies. A study on the effect of repeated measures on the detection

of differentially expressed genes has reported that stable results are typically not obtained until at least five biological replicates have been used. [P. Pavlidis et al, Bioinformatics 2003 ]

Another study on variability analysis of gene expression data suggests that at least three repeated experiments should be conducted instead of one. [M.-L. T. Lee et al, PNAS 2000]

Therefore, it is necessary to consider the data outputted by the repeated experiments when analyzing the gene expression data.


With repeated experiments, the data outputted by microarray experiments can be organized as a matrix in which each entry is a set of expression levels of a

gene under an experimental condition.

8 Conditions

3 Genes

There are 3 repeated experiments conducted under experimental condition

(column) C1, the expression value of gene (row) G1 in the first, second and third

repeated experiments (replicates) are 23, 24 and 22 respectively.


Which OPSM does G2 supports?

<25, 26, 27, 31, 36, 37, 40, 45>Expressionvalues

There are two enumerated column orderings deduced from G2 which conform to the column order constraint of this OPSM.

Let’s consider this set of replicates.

<25, 26, 27, 31, 36, 37, 41, 45>Expressionvalues

Since the expression values of G2 in column <C6,C4,C1,C3,C5,C7,C8,C2> are increasingly ordered, we say that the OPSM with column order constraint <C6,C4,C1,C3,C5,C7,C8,C2> is one of the OPSM that is possibly supported by G2.


Which OPSM does G2 supports?

<22, 27, 30, 31, 33, 36, 43, 45>Expression

values

There are six enumerated column orderings deduced from G2 which conform to the column order constraint of this OPSM.

Which OPSM does G2 more conform to?

We define the score given by a row (gene) to an OPSM being the fraction of all the enumerated column orderings which conform to the column

order constraint of the OPSM.

The problem of obtaining the counts of

the enumerated column ordering <C1,C2> is equivalent to obtaining the

number of subsequence matches of <C1,C2> in the transformed sequence

dataset.

Scoring Model

Row Column sequence

Row 1 <1,1,2,2,1,2,1,2>

Raw Dataset

Transformed Sequence Dataset

Enumerated column orderings table

Which OPSM, <C1,C2> or <C2,C1> does G1 supports?

From the raw dataset, we can enumerate all the possible column orderings and store them in the Enumerated column orderings table.

In this case, there are 16 enumerated column orderings in total.

<C1,C2> has 11 out of 16 of the enumerated column orderings, the

OPSM with column order constraint <C1,C2> scores 11/16 from G1.

<C1,C2> <C2,C1>

Counts 11 5

Score 11/16 5/16

<C1,C2> <C2,C1

>

#Subsequence matches

11 5

Score 11/16 5/16

Subsequence Matches

Enumerated column orderings counts

Similar to the conventional OPSM mining, we can

transform the raw dataset to a sequence dataset.

The denominator of the score function can be calculated by multiplying the number of replicates of the columns

involved. i.e. 4*4=16.

Scoring Model

Column order constraint

Supporting rows (#subsequence

matches )

Total Score

<a,b,c> 1(14), 3(11), 5(22), 6(63) (14+11+22+63) / 64

<a,b,d> 4(44), 5(36), 6(25) (44+36+25) / 64

<c,a,b> 1(52), 2(42), 3(14), 4(20) (52+42+14+20) / 64

<e,a,b> 2(42), 5(31), 6(36), 7(13) (42+31+36+13) / 64

… … …

Head/Tail OPSM ptr

H <a,b,c>

H <a,b,d>

T <c,a,b>

T <e,a,b>

a

b

b

c d

c

a d

root


To mine the OPSMs under the scoring model, each supporting row (gene) is associated with

the #subsequence matches of the column order constraint

(OPSM).

Here, we use the total score as the support measure of the OPSMs. Those OPSMs with total scores over a user-specified threshold are regarded as frequent.

From the #subsequence matches, we can calculate the score contributed by each row (gene) and the total sum of the scores obtained for the OPSM.

Mining OPSMs from multiple-value matrix

Row Column sequence

Row 1 <1,1,2,3,2,3,1,2,1,3,2,3>

Raw Dataset


<C1,C2> <C2,C3 >


11 10

Score 11/16 10/16

Subsequence Matches

An example raw dataset with 3 conditions (columns), and

each condition has 4 repeated experiments (replicates).

We transform the raw dataset into a sequence dataset by sorting the entries in

ascending order and replace the entries with their condition (column) IDs.

From the row 1 sequence, we found that <C1,C2> has 11 subsequence matches in

row 1. Similarly, <C2,C3> has 10 subsequence matches in row 1.

After obtaining the total scores of the two OPSMs, we found

that they are frequent. Therefore, they are stored in the size-2 frequent OPSM table.


Row Column sequence

Row 1 <1,1,2,3,2,3,1,2,1,3,2,3>

Raw Dataset


Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


<C1,C2> <C2,C3 >


11 10

Score 11/16 10/16

Subsequence Matches


Row Column sequence

Row 1 <1,1,2,3,2,3,1,2,1,3,2,3>

Raw Dataset


2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


<C1,C2> <C2,C3 >


11 10

Score 11/16 10/16

Subsequence Matches

Head-Tail Tree



In the OPSM-gen procedure, OPSMs are organized in a

Head-Tail tree data structure.


Row Column sequence

Row 1 <1,1,2,3,2,3,1,2,1,3,2,3>

Raw Dataset


2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


<C1,C2> <C2,C3 >


11 10

Score 11/16 10/16

Subsequence Matches

Head-Tail Tree

OPSM-GenOPSM-GenSubsequence Function


According to the transitivity property, Tail <C1,C2> and Head <C2,C3> can be merged to form a

size-3 OPSM.

Question: Can we deduce the #subsequence matches

(score) of <C1,C2,C3> in Row 1 from the size-2

frequent OPSMs table?

Recall that in conventional OPSM mining, we can deduce the support

of <C1,C2,C3> from the size-2 frequent OPSM table by intersecting the supporting rows s.t. we do not

need to rescan the dataset.


Raw Dataset

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

OPSM-GenOPSM-GenSubsequence Function


Enumerated ordering tables (size-2) for row 1

Essentially, obtaining the #subsequence

matches of <C1,C2,C3> is equivalent to perform a join on the column C2

of the two tables.

Since the joining information cannot be deduced from the count

(#subsequence matches), we cannot obtain the #subsequence

matches of <C1,C2,C3> without revisiting the sequence dataset.

Question: Can we materialize these tables to

facilitate the joining?


OPSM-GenOPSM-GenSubsequenceFunction

SubsequenceFunction

Size-2 OPSMs candidates

Frequent size-k OPSMs

Size k+1Candidate OPSMs

Combinatorial explosion of the

number of candidates.

Unlike the conventional OPSM mining, we have to

revisit the dataset to obtain the #subsequence

matches for the candidates.

Obtain the #subsequence matches of a candidate requires enumeration of the

column orderings, which is exponential to the size of the candidate OPSMs.

Same process has to be repeated for all rows.

Reduce the #candidates through some bounding

techniques.

Organize the candidates in a prefix tree and

verify the #subsequence matches in a single

scan over the dataset.

Compress the sequence dataset to reduce the effort for obtaining the #subsequence matches (tree

traversal).

This is an upper bound of the

#subsequence matches of

<C1,C2,C3> in row 1. If we apply this bound on all the rows, we can obtain an upper bound

of the score of an OPSM.

Min upper bound

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Motivating questions Assume the # replicates of all columns are 4. We have 11 subsequences <C1,C2> in row 1, and there are 4 “C3”s, the

maximum possible #subsequence matches of <C1,C2,C3> in row 1 is … We have 10 subsequences <C2,C3> in row 1, and there are 4 “C1”s, the

maximum possible #subsequence matches of <C1,C2,C3> in row 1 is … Therefore, the upper bound of the possible #subsequence matches of

<C1,C2,C3> in row 1 is …

44

40

40

We assume all the 4”C3”s are on the right of the 11 subsequences <C1,C2>. Therefore we guess the maximum possible #subsequence matches

of <C1,C2,C3> is 11*4= 44.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array

How many “C2”s after the 1st “C1”?

1 2 3 4

<C1,C2> 4

How many “C2”s after the 2nd

“C1”?

Construct a T-array for the tail OPSM <C1,C2>.

4 2 1

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array


1 2 3 4

<C1,C2> 4


“C1”?

4 2 1

<C2,C3>H-array1 2 3 4


4


“C2”?

3 2 1

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

With these two arrays, we can deduce the #subsequence matches of

<C1,C2,C3>.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.

There are 4 “C2”s after the 1st “C1”.


4

So we can conclude that there are 4 <C1,C2,C3> orderings

formed by the 1st C1 and the 1st C2.

# Subsequence matches of <C1,C2,C3> =

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.


There are 3 “C3”s after the 2nd “C2”.

4

Therefore we can conclude that there are 3 <C1,C2,C3> orderings formed by the 1st C1 and the 2nd

C2.

+ 3# Subsequence matches of <C1,C2,C3> =

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.


There are 2 “C3”s after the 3rd “C2”.

4 + 3 + 2# Subsequence matches of <C1,C2,C3> =

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.


There is 1 “C3” after the 4th “C2”.

4 + 3 + 2 + 1# Subsequence matches of <C1,C2,C3> =

Similar for the 2nd “C1”. There are 4 “C2”s after the

2nd “C1”.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.

4 + 3 + 2 + 1# Subsequence matches of <C1,C2,C3> =

Similar for the 2nd “C1”. There are 4 “C2”s after the

2nd “C1”.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.

4 + 3 + 2 + 1

So we can sum all the 4 entries of the H-array.

+ 10# Subsequence matches of <C1,C2,C3> =

There are 2 “C2”s after the 3rd “C1”, which slots of the H-array should we sum up?

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.

4 + 3 + 2 + 1 + 10# Subsequence matches of <C1,C2,C3> =

There are 2 “C2”s after the 3rd “C1”, which slots of the H-array should we sum up?

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.

4 + 3 + 2 + 1 + 10

Since there are only 2 “C2”s after the 3rd “C1”, the 2 “C2”s must be the 3rd and 4th “C2”s. Otherwise, T-array[3] will not be 2.

+ 2 + 1# Subsequence matches of <C1,C2,C3> =

Finally, there is 1 “C2” after the 4th “C1”.

HT arrays

Row Column sequence

Row 1 <1, 1, 2, 3, 2, 3, 1, 2, 1, 3, 2, 3>

Raw Dataset


T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1


<C1,C2,C3>.

# Subsequence matches of <C1,C2,C3> = 4 + 3 + 2 + 1 + 10

Since there is only 1 “C2” after the 4th “C1”, the “C2” must be the 4th “C2”. Otherwise, T-

array[4] will not be 1.

+ 2 + 1 + 1

= 24Finally, we can deduce that the

#subsequence matches of <C1,C2,C3> from row 1 is 24.

Can we store the HT-arrays instead of the

#subsequence matches s.t. we don’t need to rescan the

dataset in the subsequence function

procedure?

HT arrays

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

T-array

1 2 3 4

<C1,C2, … ,Cx-1>

…H-array1 2 3 …

<C2, … ,Cx-1, Cx>

However, the number of slots

of H-array is exponential to the number of columns in the

OPSMs.

Generalized HT-arrays :

This slot indicate how many “C2”s after the 1st “C1”,

therefore it’s value cannot be larger than #replicates of

C2.

To obtain an upper bound of the #subsequence matches of <C1,C2,C3>, we try to guess the T-array s.t. the #subsequence matches of <C1,C2,C3> is maximum.This can be done by assigning the “C1”s to the left in the column sequence as much as possible.

Motivation: We can obtain the bound of the #subsequence matches

of <C1,C2,C3> by guessing the HT-arrays from the #subsequence matches of the tail and

head OPSMs.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0



C2.

To obtain an upper bound of the #subsequence matches of <C1,C2,C3>, we try to guess the T-array s.t. the #subsequence matches of <C1,C2,C3> is maximum.This can be done by assigning the “C1”s to the left in the column sequence as much as possible.



head OPSMs.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

For the H-array, we assign the “C3”s to the right in the column sequence as much as possible.

4 4 3 0

Push right



head OPSMs.

The H-array cannot be .

If there are no C3 after the 1st C2, then there will not be any C3 after the 2nd, 3rd and 4th C2.Therefore, there is a constraint when assigning the value to the HT arrays:Array [x] >= Array [x+1].

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

For the H-array, we assign the “C3”s to the right in the column sequence as much as possible.

4 4 3 0

Push right

3 3 2 2

0 2 4 4



C2.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> =

Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of <C1,C2,C3> from the two guessed HT-arrays.

HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> = 3 + 3 + 2 + 2


HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2


HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound of the # Subsequence matches of <C1,C2,C3> = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2 + 3 + 2 + 2


HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2



HT upper bound

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2


= 27 The upper bound of #subsequence matches of <C1,C2,C3> from row 1 is

27.

Upper bound: 27


HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower boundSimilarly, we can obtain a lower bound of

#subsequence of <C1,C2,C3> by assigning C1 on the right of C2 as much as possible, and

C3 on the left of C2 as much as possible.

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0




Lower bound of the # Subsequence matches =of <C1,C2,C3>

6

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0




6 + 6Lower bound of the # Subsequence matches =of <C1,C2,C3>

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0




6 + 6 + 6Lower bound of the # Subsequence matches =of <C1,C2,C3>

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0




6 + 6 + 6 + 2Lower bound of the # Subsequence matches =of <C1,C2,C3>

The lower bound of the #subsequence matches of <C1,C2,C3> is 20.

HT bounds

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

2

root

1 3

Head/Tail OPSM ptr

T <C1,C2>

H <C2,C3>

… … …

Column order

constraint


matches )

<C1,C2> 1(11), …

<C2,C3> 1(10), …

… …


Head-Tail Tree

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower bound:20

6 + 6 + 6 + 2

= 20

Lower bound of the # Subsequence matches =of <C1,C2,C3>

Comparisons

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

3 3 3 2

Push right

4 4 2 0

Lower bound:20

HT array: 24

T-array1 2 3 4

<C1,C2> 4 4 2 1

<C2,C3>H-array1 2 3 4

4 3 2 1

Recall that the HT-array method can return the exact #subsequence matches of <C1,C2,C3>, which is 24.However, it is not feasible to keep the HT-arrays for each candidate.

Min upper bound: 40

The Min upper bound approach returns 40 as the upper bound of the #subsequence matches of <C1,C2,C3>.Compare with the HT-bound technique, the HT-bound is much more tighter.

Generalized HT upper bound

Tail sequence

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2

Upper bound: 27Tail = <C1,C2, … ,Cx-1>

Head = <C2, … ,Cx-1, Cx>Head sequence

Generated sequence New = <C1,C2, … ,Cx-1, Cx>

Assume the number of replicate for column Cy is r(Cy) .

Middle = <C2, … ,Cx-1>Middle sequence


…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2


Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>

The qth -slot of the T-array represents the number of “Middle sequence”s after the qth C1.

Middle = <C2, … ,Cx-1>

Therefore, the #slots for T-array is equal to the #replicates of C1.

The maximum possible value for each slot is equal to the maximum possible #subsequence matches for the middle sequence.

Maximum possible value

Tail sequence

Head sequence

Generated sequence


Middle sequence

Therefore the #slots for H-array is equal to the maximum possible #subsequence matches of the middle sequence.


…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2


Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>



… … …H-array1 2 3 ………………

<C2, … ,Cx-1, Cx>

The qth -slot of the H-array represents the number of “Cx“s after the qth “Middle sequence”.

The maximum possible value for each slot in H-array is equal to the #replicates of Cx.


Tail sequence

Head sequence

Generated sequence


Middle sequence

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros.


…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2


Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>



Let T be the #subsequencematches for the Tail sequence(i.e. 11 in the example)

Tail sequence

Head sequence

Generated sequence


Middle sequence



…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2


Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>



e.g. The first = 2 slots with value 4.


Tail sequence

Head sequence

Generated sequence


Middle sequence

Rule 1: The first slots with value .



…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2


Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>





Rule 2: The slot with value .

e.g. The 3rd slot with value 11 mod 4 = 3.

Tail sequence

Head sequence

Generated sequence


Middle sequence



…T-array

1 2 3 …

<C1,C2, … ,Cx-1>

T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2


Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>






Rule 3: The other slots with value zero.

Tail sequence

Head sequence

Generated sequence


Middle sequence


T-array1 2 3 4

<C1,C2>

<C2,C3>H-array1 2 3 4

Push left

4 4 3 0

Push right

3 3 2 2


Head = <C2, … ,Cx-1, Cx>

New = <C1,C2, … ,Cx-1, Cx>


Let H be the #subsequencematches for the Head sequence(i.e. 10 in the example)


Rule 2: The other slots with value .

… … …H-array1 2 3 ………………

<C2, … ,Cx-1, Cx> Maximum possible value

Similar to the T-array, the H-array can be divided into two partitions, the values in the first partition are larger than the values in the second partition by 1.

Tail sequence

Head sequence

Generated sequence


Middle sequence


…T-array

1 2 3 …

<C1,C2, … ,Cx-1>Maximum

possible value

Let T be the #subsequencematches for the Tail sequence.



Rule 3: The other slots with value zero.

Let H be the #subsequencematches for the Head sequence.


Rule 2: The other slots with value .

… … …H-array1 2 3 ………………


With these rules, we can deduce a formula to calculate the upper bound without constructing these arrays.

Similar method can be applied for the HT-lower bound, therefore we do not need to materialize any of the HT-arrays.

Compression method

Row Column sequence

Row 1 <1, 1, 1, 2, 1, 2, 2, 2,>

Transformed Sequence DatasetGiven a column sequence, we would like to find the #subsequence matches of <C1,C2> in the column sequence of row 1.

The naive method is to enumerate all the size-2 subsequences and count the occurrence of <C1,C2>, which requires enumerating 16 column orderings.

Row Column sequence

Row 1 <1(3), 2, 1, 2(3)>

Compressed Sequence Dataset

#subsequence matches of <C1,C2> in row 1 : 3*1

There are 3 “C1”s on the left of 1”C2”, therefore there are 3*1= 3 <C1,C2>s.

Compression method

Row Column sequence

Row 1 <1, 1, 1, 2, 1, 2, 2, 2,>



Row Column sequence

Row 1 <1(3), 2, 1, 2(3)>


#subsequence matches of <C1,C2> in row 1 : 3*1+ 3*3 + 1*3

There are 3 “C1”s on the left of 3”C2”s, therefore there are 3*3= 9 <C1,C2>s.

There are 3 “C1”s on the left of 1”C2”, therefore there are 3*1= 3 <C1,C2>s.

There are 1 “C1” on the left of 3”C2”s, therefore there are 1*3= 3 <C1,C2>s.

Compression method

Row Column sequence

Row 1 <1, 1, 1, 2, 1, 2, 2, 2,>



Row Column sequence

Row 1 <1(3), 2, 1, 2(3)>


#subsequence matches of <C1,C2> in row 1 : 3*1+ 3*3 + 1*3

There are 15 <C1,C2>s in total. This way to obtain the #subsequence matches only requires enumerating 3 column orderings.

= 15

Experimental Evaluation

Experimental settings

C programming language Machine

CPU : 2.6 GHz Memory : 1 Gb Fedora

Dataset Real dataset : Yeast galactose dataset

Subset of 205 genes (rows) yeast galactose data 20 experimental conditions (columns) 4 biological replicates per condition Publicly available :

http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003.html

Synthetic dataset Replicate simulation - Generate normal distributions according to means and variances

of the replicates in the real dataset, and randomly generate a new replicate value according to the distribution.

Column simulation – Generate a new column by randomly select an experimental condition in the real dataset and perturb the mean and variance.

Row simulation – Generate normal distributions according to means and variances of the replicates in the real dataset, and generate a new row according to the distributions.

Execution time per iteration

The Brute-force approach is to mine the OPSMs without using any bounding techniques. All the algorithms start from mining size-2 OPSMs.

For the HT-bounds, we use the HT upper bound to identify infrequent candidates which can be pruned, and we use the HT lower bound to identify large OPSMs. We do not verify the #subsequence matches for those large OPSMs.

The number of candidates generated in each iteration using different bounding techniques

The HT upper bound technique can reduce the #candidates by more than a half in all iterations.

Execution time per iteration

The HT bounds + compression approach uses the HT upper and lower bounds to reduce candidate set, and uses the compression method to reduce the cost of obtaining the #subsequence matches of the candidates.

The number of candidates generated in each iteration using different bounding techniques

Execution time in each iterationusing different bounding techniques

For the HT-bounds, we use the HT upper bound to identify infrequent candidates which can be pruned, and we use the HT lower bound to identify large OPSMs. We do not verify the #subsequence matches for those large OPSMs.

Vary the support threshold

The saving from the HT upper bound decreases as the support threshold decreases. It is because it’s harder for an upper bound to be less than the support requirement (pruning condition) as the support requirement decreases.

Scalability test on support threshold

The saving from the lower bound increases as the support threshold decreases. The reason is that as support requirement decreases, the differences between the supports of large candidates and the support requirement increase, those large OPSMs become more obvious and become more easy to identify.

Execution time saving (%)compared with the Brute force approach

The HT bounds + compression method achieves the best execution time saving.

Vary the #columns

Scalability test on #columns

Essentially, increase in columns will increase the number of candidates generated but NOT the cost of obtaining the #subseqeunce matches for the candidates.

The pruning power of the bounding techniques are quite independent to the number of columns in the dataset.


Vary the #replicates

Scalability test on #replicates


The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why?



…T-array

1 2 3 …


possible value

… … …H-array1 2 3 ………………


HT Upper bound

The number of slots of the T and H arrays are determined by the #replicates, essentially, the larger the arrays, the looser the bounds.




Execution time saving (%)compared with the Brute force approachMin upper bound: 40

We have 11 subsequences <C1,C2> in row 1, and there are 4 “C3”s (#replicates), the maximum possible #subsequence matches of <C1,C2,C3> in row 1 is … 11*4=44

…T-array

1 2 3 …


possible value

… … …H-array1 2 3 ………………


HT Upper bound

The number of slots of the T and H arrays are determined by the #replicates, essentially, the larger the arrays, the looser the bounds.

In Min upper bound, we multiply the #replicate of C3 with #subsequences of <C1,C2>. The tightness of the Min bound is also determined by the #replicates.


Scalability test on #replicates



The saving from HT bounds + compression method increases as #increases.This is mainly due to the saving from compressing the sequence s.t. the #enumerated sequences is reduced.

Vary the #rowsScalability test on #rows

Conclusion

Single microarray output is subject to substantial variability, replication is the common practice to address this issue.

We have proposed a scoring model to mine the Order Preserving Submatrixes from gene expression dataset with repeated measurements.

Mining OPSMs under the scoring model requires heavy computational cost (obtaining #subsequence matches) An HT Bounding technique and compression method is

proposed to efficiently mine the OPSMs. Experimental results show that the HT bounding technique +

compression method achieves the best CPU cost saving.

Things not covered in this talk

Biological evaluation of cluster quality : oPOSSIUM, Gene Ontology, ARI

Efficient method of the subsequence function. Prefix tree to organize the candidates, verify the

supports through a single dataset scan. Compression on the sequence dataset, reduce the

#prefix tree traversal. Bounding techniques Application in other areas :Collaborative Filtering Visualization of OPSMs

End

Thank you!

Date post:	03-Jan-2016
Category:	Documents
Upload:	arron-wright
View:	214 times
Download:	0 times

Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui...

Documents